0% found this document useful (0 votes)
13 views

unit-1-total-data-visualization-techniques

Uploaded by

210823405001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

unit-1-total-data-visualization-techniques

Uploaded by

210823405001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

lOMoARcPSD|30159120

UNIT-1-Total-Data Visualization Techniques

CSE (AI and ML) (Osmania University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])
lOMoARcPSD|30159120

Data Visualization Techniques:


Unit-I: Introduction : Basics , Relationship between Visualization and Other
Fields, The Visualization Process, Pseudo Code Conventions, The Scatter plot.

Basics:
What Is Visualization?
Visualization is the communication of information using graphical
representations. Pictures have been used as a mechanism for communication
since before the formalization of written language. A single picture can contain a
wealth of information, and can be processed much more quickly than a comparable
page of words.
This is because image interpretation is performed in parallel within the human
perceptual system, while the speed of text analysis is limited by the sequential
process of reading. Pictures can also be independent of local language, as a graph or
a map may be understood by a group of people with no common tongue.

History of Data Visualization:


The concept of using picture was launched in the 17th century to understand the
data from the maps and graphs, and then in the early 1800s, it was reinvented to the
pie chart.
Several decades later, one of the most advanced examples of statistical graphics
occurred when Charles Minard mapped Napoleon's invasion of Russia. The map
represents the size of the army and the path of Napoleon's retreat from Moscow - and
that information tied to temperature and time scales for a more in-depth understanding
of the event.
Computers made it possible to process a large amount of data at lightning-fast
speeds. Nowadays, data visualization becomes a fast-evolving blend of art and
science that certain to change the corporate landscape over the next few years.

Importance of Data Visualization:


Data visualization is important because of the processing of information in human
brains. Using graphs and charts to visualize a large amount of the complex data sets
is more comfortable in comparison to studying the spreadsheet and reports.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

Data visualization is an easy and quick way to convey concepts universally. You
can experiment with a different outline by making a slight adjustment.

Examples Of Data Visualization:-


In the early days of visualization, the most common visualization technique was
using a Microsoft Excel spreadsheet to transform the information into a table, bar
graph or pie chart. While these visualization methods are still commonly used, more
intricate techniques are now available, including the following:

Info Graphics
Bubble Clouds
Bullet Graphs
Heat Maps
Fever Charts
Time Series Charts

Some other popular techniques are as follows:

Line charts. This is one of the most basic and common techniques used. Line charts
display how variables can change over time.

Area charts. This visualization method is a variation of a line chart; it displays multiple
values in a time series -- or a sequence of data collected at consecutive, equally
spaced points in time.

Scatter plots. This technique displays the relationship between two variables. A
scatter plot takes the form of an x- and y-axis with dots to represent data points.

Relationship between Visualization and Other Fields:-

What Is the Difference between Visualization and Computer Graphics?

Originally, visualization was considered a subfield of computer graphics, primarily


because visualization uses graphics to display information via images.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

As illustrated by any of the computer-generated images shown earlier,


visualization applies graphical techniques to generate visual displays of data. Here,
graphics is used as the communication medium.
In all visualizations, one can clearly see the use of the graphics primitives (points,
lines, areas, and volumes). Beyond the use of graphics, the most important aspect of
all visualizations is their connection to data.
Computer graphics focuses primarily on graphical objects and the organization of
graphic primitives; visualizations go one step further and are based on the underlying
data, and may include spatial positions, populations, or physical measures.
Consequently, visualization is the application of graphics to display data by mapping
data to graphical primitives and rendering the display.

However, visualization is more than simply computer graphics. The field


of visualization encompasses aspects from numerous other disciplines, including
human-computer interaction, perceptual psychology, databases, statistics, and data
mining, to name a few.
While computer graphics can be used to define and generate the displays that
are used to communicate the information, the sources of data and the way users
interact and perceive the data are all important components to understand when
presenting information.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

The Visualization Process:-


What is involved in the visualization process?
The designer of a new visualization most often begins with an analysis
of the type of data available for display and of the type of information the
viewer hopes to extract from or convey with the display. The data can come
from a wide variety of sources and may be simple or complex in structure.

The visualization process at a very high or primitive level view.

These user interface components are often input via dialog boxes, but
they could be visual representations of the data to facilitate the selections
required by the user. Visualizations can provide mechanisms for translating
data and tasks into more visual and intuitive formats for users to perform
their tasks.
This means that the data values themselves, or perhaps the attributes of
the data, are used to define graphical objects, such as points, lines, and
shapes; and their attributes, such as size, position, orientation, and color.

Thus, for example, a list of numbers can be plotted by mapping each


number to the y-coordinate of a point and the number’s index in the list to the
x- coordinate.
Visualization in data exploration is used to convey information, discover
new knowledge, and identify structures, patterns, anomalies, trends, and
relationships.
The process of starting with data and generating an image, a visualiza-
tion, or a model via the computer is traditionally described as a pipeline—a
sequence of stages that can be studied independently in terms of algorithms,
data structures, and coordinate systems.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

These processes or pipelines are 3 different for All start with data and end
with the user.

1 Computer Graphics
2 Visualization
3 Knowledge Discovery

The Computer Graphics Pipeline:-


Modeling. A three-dimensional model, consisting of planar polygons defined
by vertices and surface properties, is generated using a world coordi- nate
system.
Viewing. A virtual camera is defined at a location in world coordinates, along
with a direction and orientation (generally given as vectors). All vertices are
transformed into a viewing coordinate system based on the camera
parameters.
Clipping. By specifying the bounds of the desired image (usually given by
corner positions on a plane of projection placed in front of the camera),
objects out of view can be removed, and those that are partially visible can be
clipped. Objects may be transformed into a normalized viewing coordinates to
simplify the clipping process. Clipping can actually be performed at many
different stages of the pipeline.
Hidden surface removal: Polygons facing away from the camera, or
those obscured by others, are removed or clipped. This process may be
integrated into the projection process.
Projection: Three-dimensional polygons are projected onto the two-
dimensional plane of projection, usually using a perspective transformation.
The results may be in a normalized 2D coordinate system or device/screen
coordinates.
Rendering: The actual color of the pixels associated with a visible polygon
depends on a number of factors, including the material properties being
synthesized (base color, texture, surface roughness, shininess), the type(s),

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

location(s), color, and intensity of the light source(s), the degree of occlusion
from direct light exposure, and the amount and color of light being reflected
off of other objects onto the polygon.

The graphics pipeline.

The Visualization Pipeline:-


The data/information visualization pipeline has some similarities to the
graphics pipeline, at least on an abstract level. The stages of this pipeline
are as follows:
Data modeling:- The data to be visualized, whether from a file or a database,
has to be structured to facilitate its visualization. The name, type, range, and
semantics of each attribute or field of a data record must be available in a
format that ensures rapid access and easy modification.
Data selection:- Similar to clipping, data selection involves identifying the
subset of the data that will be potentially visualized. This can occur totally
under user control or via algorithmic methods, such as cycling through time
slices or automatically detecting features of potential interest to the user.
Data to visual mappings:- The heart of the visualization pipeline is performing
the mapping of data values to graphical entities or their attributes. Thus, one
component of a data record may map to the size of an object, while others
might control the position or color of the object. This mapping often involves
processing the data prior to mapping, such as scaling, shifting, filtering,
interpolating.
Scene parameter setting (view transformations). As in traditional graphics, the user
must specify several attributes of the visualization that are rel- atively
independent of the data. These include color map selection.
Example . There are many variants, but all trans- form data into some
internal representation within the computer and then use some visual
paradigm to display the data on the screen.
Rendering or generation of the visualization: The specific projection or render- ing
of the visualization objects varies according to the mapping being used;
techniques such as shading or texture mapping might be involved, although
many visualization techniques only require drawing lines and uniformly

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

shaded polygons.

The Knowledge Discovery Pipeline


The knowledge discovery (also called data mining) field has its own
pipeline. As with the graphics and visualization pipelines, we start with data; in
this case we process it with the goal of generating a model, rather than some
graphics display.
Note that the visualization pipeline can be overlaid on this knowledge
discovery (KD) pipeline. If we were to look at a pipeline for typical statistical
analysis procedures, we would Find the same process structure:

Data:- In the KD Pipeline there is more focus on data, as the graphics and
visualization processes often assume that the data is already structured to
facilitate its display.

Data integration, cleaning, warehousing and selection:- These involve identifying


the various data sets that will be potentially analyzed. Again, the user may
participate in this step. This can involve filtering, sampling, sub- setting,
aggregating, and other techniques that help curate and manage the data for
the data mining step.

Data mining:-The heart of the KD pipeline is algorithmically analyzing the


data to produce a model.
Pattern evaluation: The resulting model or models must be evaluated to
determine their robustness, stability, precision, and accuracy.
Rendering or visualization: The specific results must be presented to the user.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

It does not matter whether we think of this as part of the graphics or


visualization pipelines; the fact is that a user will eventually need to see the
results of the process. Model visualization is an exciting research area that
will be discussed later.

Interactive visualization can be used at every step of the KD pipeline.


One can think of this as computational steering.

The Role of Perception


In all visualizations, a critical aspect related to the user is the
abilities and limitations of the human visual system. If the goal of visualization
is to accurately convey information with pictures, it is essential that perceptual
abilities be considered.

Pseudo Code Conventions:-


Throughout the text we include pseudo code wherever possible. In
our pseudo code, we aim to convey the essence of the algorithms at hand,
while leaving out details required for user interaction, graphics nuances, and
data management. We therefore assume that the following global variables
and functions exist in the environment of the pseudo code:

• Data—The working data table. This data table is assumed to contain


only numeric values. In practice, dimensions of the original data table that
contain non-numeric values must be somehow converted to
numeric values. When visualizing a subset of the entire original data
table, the working data table is assumed to be the subset.
• m—The number of dimensions (columns) in the working data table.
Dimensions are typically iterated over using j as the running dimension
index.
• n—The number of records (rows) in the working data table. Records
are typically iterated over using i as the running record index.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

• Normalize(record, dimension):- Normalize(record, dimension, min,


max)—A function that maps the value for the given record and
dimension in the working data table to a value between min and max,
or between zero and one if min and max are not specified. The normalization
is typically linear and local to a single dimension.
For example, a scatterplot may require a linear normalization
for the x-axis and a logarithmic normalization for the y-axis.

• Color(color)—A function that sets the color state of the graphics


environment to the specified color (whose type is assumed to be an
integer containing RGB values).

• MapColor(record, dimension)—A function that sets the color state


of the graphics environment to be the color derived from applying
the global color map to the normalized value of the given record and
dimension in the working data table.

• Circle(x, y,radius)—A function that fills a circle centered at the given


(x, y)-location, with the given radius, with the color of the color state
of the graphics environment.
• Polyline(xs, ys)—A function that draws a polyline (many connected
line segments) from the given arrays of x and y coordinates.
• Polygon(xs, ys)—A function that fills the polygon defined by the
given arrays of x- and y-coordinates with the color of the current color
state.

For geographic visualizations, the following functions are assumed to


exist in the environment:

• GetLatitudes(record), GetLongitudes(record)—Functions that


retrieve the arrays of latitude and longitude coordinates, respectively,
of the geographic polygon associated with the given record. For example,
these polygons could be outlines of the countries of the world.
• ProjectLatitudes(lats, scale), Project Longitudes(longs, scale)

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

—Functions that project arrays of latitude values to arrays of y values,


and arrays of longitude values to arrays of x values, respectively.

For graph and 3D surface data sets, the following is provided:

• GetConnections(record)—A function that retrieves an array of record


indices to which the given record is connected.
Arrays are indexed starting at zero

The Scatter plot:


The scatter plot is one of the earliest and most widely used
visualizations developed. It is based on the Cartesian coordinate system.
This will give us some experience with transforming data into a visual
representation that is understood by most readers.

The following pseudo code renders a scatter plot of circles. Records are
represented in the Scatter plot as circles of varying location, color, and size.
The x- and y-axes represent data from dimension numbers xDim and
yDim, respectively. The color of the circles is derived from dimension number
cDim.
The radius of the circles is derived from dimension number rDim, as
well as from the upper and lower bounds for the radius, rM in and rMax.

Scatterplot(xDim, yDim, cDim, rDim, rM in, rM ax)


1 for each record I // For each record
2 do x ← Normalize(i, xDim) // derive the location
3 y ← Normalize(i, yDim)
4 r ← Normalize(i, rDim, rM in, rM ax) // radius
5 MapColor(i, cDim) // and color, then
6 Circle(x, y, r) // draw the record as a circle.

Example : student Marks x -axis as std id, y –axis as Marks.

Std Id Marks
101 40

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

102 50
103 60
104 55
105 55
106 70
107 60
108 50

Student Marks
80

60
Marks

40
Marks
20

0
100 102 104 106 108 110 112
Student id

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

Unit-1-Data Foundation : Types Of Data, Structure Within And


Record, Data Preprocessing , Data Sets.

Data foundation:
Data comes from many sources; it can be gathered from sensors or surveys, or
it can be generated by simulations and computations.
Data can be raw (untreated), or it can be derived from raw data via some
process, such as smoothing, noise removal, scaling, or interpolation. It also can have a
wide range of characteristics and structures.
A typical data set used in visualization consists of a list of n records,
(r1, r2,...,rn). Each record ri consists of m (one or more) observations or
variables, (v1, v2,...vm)
A variable may be classified as either independent or dependent.

An independent variable ivi : Its value is not controlled or affected by another


variable, such as the time variable in a time-series data set.
A dependent variable dvj : Its value is affected by a variation in one or more
associated independent variables.
ri = (iv1, iv2, . . . ivmi , dv1, dv2, . . . dvmd )
where mi is the number of independent variables and md is the number of dependent
variables, m = mi + md

Types of Data:
In its simplest form, each observation or variable of a data record represents
a single piece of information. We can categorize this information as being
Ordinal (Numeric) Or Nominal (Nonnumeric). Subcategories of each can be
readily defined.

1. Ordinal.:- The data take on numeric values:


 Binary—Assuming Only Values Of 0 And 1;
 Discrete—Taking On Only Integer Values Or From A Specific Subset
(E.G., (2, 4, 6));
 Continuous—Representing Real Values (E.G., In The Interval [0, 5]).

2.Nominal:- The data take on nonnumeric values:


 Categorical—a value selected from a finite (often short) list of possibilities (e.g.,
red, blue, green);
 Ranked—a categorical variable that has an implied ordering (e.g., small,
medium, large);

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

 Arbitrary—a variable with a potentially infinite range of values with no implied


ordering (e.g., addresses).

Another method of categorizing variables is by using the mathematical concept of scale.

Scale:- Three attributes that define a variable’s measure are as follows:

 Ordering relation, with which the data can be ordered in some fashion. By
definition, ranked nominal variables and all ordinal variables exhibit this relation.
 Distance metric, with which the distances can be computed between different
records. This measure is clearly present in all ordinal variables, but is generally
not found in nominal variables.

 Existence of absolute zero, in which variables may have a fixed lowest value.
This is useful for differentiating types of ordinal variables. A variable such as
weight possesses an absolute zero, while bank balance does not. A variable
possesses an absolute zero if it makes sense to apply all four mathematical
operations (+, −, ×, ÷) to it [129]

Structure within and between Records:

Data sets have structure, both in terms of the means of Representation (Syntax ),
and the types of interrelationships within a given record and between
Records (Semantics). The data records considered in following ways.
 Scalars,
 Vectors
 Tensors

Scalar :- An individual number in a data record is often referred to as a scalar. Scalar


values, such as the cost of an item or the age of an individual, are often
the focus for analysis and visualization.

Vector:- Multiple variables within a singlerecord can represent a composite data item.
For example, a point in a two-dimensional flow field might be represented by a pair of
values, such as a displacement in x and y. This pair, and any such composition, is
referred to as a vector.
While each component of a vector might be examined individually, it is most
common to treat the vector as a whole.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

Tensor :- A tensor is defined by its rank and by the dimensionality of the


space within which it is defined. It is generally represented as an array or
matrix. Scalars and vectors are simple variants on a more general structure known
as a tensor. A scalar is a tensor of rank 0, while a vector is a tensor of rank 1.

A 3 × 3 matrix to represent a tensor of rank 2 in 3D space,


and in general, a tensor of rank M in D-dimensional space requires DM data.

Data in Geometry and Grids form: Geometric structure can commonly be found in
data sets, especially those from scientific and engineering domains. The simplest
method of incorporating geometric structure in a data set is to have explicit coordinates
for each data record.
It is assumed that some form of grid exists, and the data set is structured such
that successive data records are located at successive locations on the grid.
It would be sufficient to indicate a starting location, orientation, and the step size
horizontally and vertically. There are many different coordinate systems that are used
for grid structured data, including cartesian, spherical, and hyperbolic coordinates.

Data Preprocessing:-

Data preprocessing is a Data Mining method that entails converting raw data into
a format that can be understood. Real-world data is frequently inadequate, inconsistent,
and/or lacking in specific activities or trends, as well as including numerous inaccuracies.

This might result in low-quality data collection and, as a result, low-quality models
based on that data. Preprocessing data is a method of resolving such problems.
Machines do not comprehend free text, image, or video data; instead, they comprehend
1s and 0s.
Data Preprocessing is the step in any Machine Learning process in which the
data is changed, or encoded, to make it easier for the machine to parse it. In other
words, the algorithm can now easily interpret the data’s features.

Data Preprocessing can be done in four categories.

1. Data Cleaning/Cleaning,
2. Data Integration,
3. Data Transformation,
4. Data Reduction.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

1.Data Cleaning :-
Data in the real world is frequently incomplete, noisy, and inconsistent. Many bits
of the data may be irrelevant or missing. Data cleaning is carried out to handle this
aspect. Data cleaning methods aim to fill in missing values, smooth out noise while
identifying outliers, and fix data discrepancies. Unclean data can confuse data and the
model. Therefore, running the data through various Data Cleaning/Cleansing methods is
an important Data Preprocessing step.

(a) Missing Data :


It’s fairly common for your dataset to contain missing values. It could have
happened during data collection or as a result of a data validation rule, but missing
values must be considered anyway.
I.Dropping rows/columns: If the complete row is having NaN values then it doesn't
make any value out of it. So such rows/columns are to be dropped immediately. Or
if the % of row/column is mostly missing say about more than 65% then also one
can choose to drop.
II. Checking for duplicates: If the same row or column is repeated then also you
can drop it by keeping the first instance. So that while running machine learning
algorithms, so as not to offer that particular data object an advantage or bias.
III. Estimate missing values: If only a small percentage of the values are missing,
basic interpolation methods can be used to fill in the gaps. However, the most
typical approach of dealing with missing data is to fill them in with the feature’s
mean, median, or mode value.

(b) Noisy Data:


Noisy data is meaningless data that machines cannot interpret. It can be caused
by poor data collecting, data input problems, and so on. It can be dealt with in the
following ways:
i. Binning Method: This method smooths data that has been sorted. The data is
divided into equal-sized parts, and the process is completed using a variety of
approaches. Each segment is dealt with independently. All data in a segment can
be replaced by its mean, or boundary values can be used to complete the task.
ii. Clustering: In this method, related data is grouped in a cluster. Outliers may go
unnoticed, or they may fall outside of clusters.
iii. Regression: By fitting data to a regression function, data can be smoothed out.
The regression model employed may be linear (with only one independent

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

variable) or multiple (with numerous independent variables) (having multiple


independent variables).

2.Data Integration:-

Data
Source-1

Data
Integratio UNIFIED VIEW
Data
Source-2 n

Data
Source-n

It is involved in a data analysis task that combines data from multiple sources into
a coherent data store. These sources may include multiple databases. Do you think how
data can be matched up ?? For a data analyst in one database, he finds Customer_ID
and in another he finds cust_id, How can he sure about them and say these two belong
to the same entity. Databases and Data warehouses have Metadata (It is the data about
data) it helps in avoiding errors.

3. Data Transformation:-
This stage is used to convert the data into a format that can be used in the mining
process. This is done in the following ways:
1. Normalization: It is done to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
2. Concept Hierarchy Generation: Using concept hierarchies, low-level or
primitive/raw data is substituted with higher-level concepts in data generalization.
Categorical qualities, for example, are generalized to higher-level notions such as
street, city, and nation. Similarly, numeric attribute values can be translated to
higher-level concepts like age, such as youthful, middle-aged, or elderly.
3. Smoothing: Smoothing works to remove the noise from the data. Such
techniques include binning, clustering, and regression.
4. Aggregation: Aggregation is the process of applying summary or aggregation
operations on data. Daily sales data, for example, might be combined to calculate
monthly and annual totals.

4.Data Reduction:
Because data mining is a methodology for dealing with large amounts of data.
When dealing with large amounts of data, analysis becomes more difficult. We employ a

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

data reduction technique to get rid of this. Its goal is to improve storage efficiency while
lowering data storage and analysis expenses.

1. Dimensionality Reduction :A huge number of features may be found in most


real-world datasets. Consider an image processing problem: there could be
hundreds of features, also known as dimensions, to deal with. As the name
suggests, dimensionality reduction seeks to minimize the number of features — but
not just by selecting a sample of features from the feature set, which is something
else entirely — Feature Subset Selection or feature selection.

2. Numerosity Reduction:
Data is replaced or estimated using alternative and smaller data representations
such as parametric models (which store only the model parameters rather than the
actual data, such as Regression and Log-Linear Models) or non-parametric
approaches (e.g. Clustering, Sampling, and the use of histograms).

The Preprocessing of Text Data And Image Data:


If the data has text and images the preprocessing is a little different.

Preprocessing of Text Data:


Preprocessing the text data is a very important step while dealing with text data
because the text at the end is to be converted into features to feed into the model. The
objective of preprocessing text data is that we won't get rid of characters, words, others
that don’t give value to us. We want to get rid of punctuations, stop words, URLs, HTML
codes, spelling corrections, etc.

Steps to perform for text pre-processing


 Read the text— Read the text data and store it in a variable
 Store in the list — Using df.tolist() store the sentences in a list.
 Initialize the Preprocess object and pass techniques*
 Iterate through the list to get the processed text.

For the after reading text data we will apply Preprocess object present in preprocessing

Preprocessing of Image data:


The term “image pre-processing” refers to actions on images at the most basic
level. If entropy(degree of randomness) is an information metric, these methods do not
improve image information content, but rather decrease it. Pre-processing aims to
improve image data by suppressing unwanted distortions or enhancing particular visual
properties that are important for subsequent processing and analysis.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

Steps to perform for image pre-processing


 Read image — Read the images
 Resize image — Resize the images because the image size captured and fed to the
model is different. So it is good to establish a base size and resize the images
 Remove noise (Denoise) — Using Gaussian blur inside the function processing() we
can smooth the image to remove unwanted noise.

 Segmentation & Morphology(smoothing edges)— We will segment the image in


this stage, separating the background from foreground objects, and then we will refine
our segmentation with more noise removal.

There are 4 different types of Image Pre-Processing techniques and they are listed
below.
1. Pixel brightness transformations/ Brightness corrections
2. Geometric Transformations
3. Image Filtering and Segmentation
4. Fourier transform and Image restauration
In this blog I will talk about Pixel brightness transformations:

Pixel brightness transformations :


The most common Pixel brightness transforms operations are
1. Gamma correction or Power Law Transform
2. Histogram equalization
3. Sigmoid stretching.
g(x)=αf(x)+β, alpha and beta control contrast and brightness of the image.
1.Gamma correction — It is a non-linear adjustment to individual pixel values.

Data Sets :-

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

There are three general characteristics of Data Sets namely: Dimensionality,


Sparsity, and Resolution.

Dimensionality:-
The dimensionality of a data set is the number of attributes that the objects in the
data set have. In a particular data set if there are high number of attributes (also called
high dimensionality), then it can become difficult to analyse such a data set.

Sparsity
For some data sets, such as those with asymmetric features, most attributes of an
object have values of 0; in many cases fewer than 1% of the entries are non-zero. Such
a data is called sparse data or it can be said that the data set has Sparsity.

Resolution
The patterns in the data depend on the level of resolution. If the resolution is too fine, a
pattern may not be visible or may be buried in noise; if the resolution is too coarse, the
pattern may disappear. For example, variations in atmospheric pressure on a scale of
hours reflect the movement of storms and other weather systems. On a scale of months,
such phenomena are not detectable.

Data Sets are classified into three categories namely,


1. Record Data,
2. Graph-based Data
3. Ordered Data.

Record Data:

The most basic form of record data has no explicit relationship among records or data
fields, and every record (object) has the same set of attributes. Record data is usually
stored either in flat files or in relational databases.

There are a few variations of Record Data, which have some characteristic properties.
1. Transaction or Market Basket Data: It is a special type of record data, in which
each record contains a set of items. For example, shopping in a supermarket or a
grocery store. For any particular customer, a record will contain a set of items
purchased by the customer in that respective visit to the supermarket or the grocery
store. This type of data is called Market Basket Data. Transaction data is a
collection of sets of items, but it can be viewed as a set of records whose fields are
asymmetric attributes. Most often, the attributes are binary, indicating whether or not
an item was purchased or not.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

2. The Data Matrix: If the data objects in a collection of data all have the same fixed
set of numeric attributes, then the data objects can be thought of as points
(vectors)in a multidimensional space, where each dimension represents a distinct
attribute describing the object. A set of such data objects can be interpreted as an m
X n matrix, where there are n rows, one for each object, and n columns, one for
each attribute. Standard matrix operation can be applied to transform and
manipulate the data. Therefore, the data matrix is the standard data format for most
statistical data.

3. The Sparse Data Matrix: A sparse data matrix (sometimes also called document-
data matrix)is a special case of a data matrix in which the attributes are of the same
type and are asymmetric; i.e., only non-zero values are important.

Graph-based Data

This can be further divided into types:


1. Data with Relationships among Objects: The data objects are mapped to nodes
of the graph, while the relationships among objects are captured by the links
between objects and link properties, such as direction and weight. Consider Web
pages on the World Wide Web, which contain both text and links to other pages. In
order to process search queries, Web search engines collect and process Web
pages to extract their contents.

2. Data with Objects That Are Graphs: If objects have structure, that is, the objects
contain sub objects that have relationships, then such objects are frequently
represented as graphs. For example, the structure of chemical compounds can be
represented by a graph, where the nodes are atoms and the links between nodes
are chemical bonds.

Ordered Data

For some types of data, the attributes have relationships that involve order in time or
space. It can be segregated into four types:
1. Sequential Data: Also referred to as temporal data, can be thought of as an
extension of record data, where each record has a time associated with it. Consider
a retail transaction data set that also stores the time at which the transaction took
place
2. Sequence Data: Sequence data consists of a data set that is a sequence of
individual entities, such as a sequence of words or letters. It is quite similar to

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])


lOMoARcPSD|30159120

sequential data, except that there are no time stamps; instead, there are positions in
an ordered sequence.
3. Time Series Data: Time series data is a special type of sequential data in which
each record is a time series, i.e., a series of measurements taken over time. For
example, a financial data set might contain objects that are time series of the daily
prices of various stocks.
4. Spatial Data: Some objects have spatial attributes, such as positions or areas, as
well as other types of attributes. An example of spatial data is weather data
(precipitation, temperature, pressure) that is collected for a variety of geographical
locations.

Downloaded by BAKYALAKSHMI M ME - CSE ([email protected])

You might also like