0% found this document useful (0 votes)
7 views

Module-2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module-2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 83

SRI RAMAKRISHNA ENGINEERING

COLLEGE
[Educational Service : SNR Sons Charitable Trust]
[Autonomous Institution, Reaccredited by NAAC with ‘A+’ Grade]
[Approved by AICTE and Permanently Affiliated to Anna University, Chennai]
[ISO 9001:2015 Certified and all Eligible Programmes Accredited by NBA]
VATTAMALAIPALAYAM, N.G.G.O. COLONY POST, COIMBATORE – 641 022.

Department of Information Technology

20IT211- Data Science

Presentation by
Mrs.S.Jansi Rani, AP(Sr.Gr)/IT
COURSE OUTCOMES
20IT211- Data Science
Understand the basic concepts of data science and
CO1 PO1,PO2,PO12
data mining

CO2 Identify the techniques to explore and evaluate data PO3,PO5,PO12

Apply various data mining algorithms for real time PO2,PO3,PO5,P


CO3
applications O12

Implement the concepts of clustering and model


CO4 PO3,PO5,PO12
evaluation

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 2


20IT211- Data Science

Module I : INTRODUCTION 9 hours

What is data science – Case for data science – Data


science classification – Data science algorithms – Data
science process – Prior knowledge – Data preprocessing –
Data cleaning – Data integration – Data reduction – Data
transformation and data discretization – Feature selection
– Data sampling – Modeling – Application.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 3


20IT211- Data Science

Module II : DATA EXPLORATION AND VISUALIZATION


9 hours

Objectives of Data exploration – Datasets – Descriptive


statistics – Data Visualization – Univariate visualization –
Multivariate visualization – Visualizing high dimensional
data – Roadmap for data exploration.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 4


20IT211- Data Science

Module III : CLASSIFICATION AND ASSOCIATION


ANALYSIS 18 hours

Basic concepts of Classification – Decision tree induction –


Bayes classification methods – Rule based classification –
Techniques to improve classification accuracy – Support vector
machines – Regression methods: Linear regression – Logistic
regression – Association analysis: Frequent Item set mining
methods – Pattern evaluation methods.
19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 5
20IT211- Data Science

Module IV : CLUSTERING AND MODEL EVALUATION 9


hours

Basic concepts and methods in cluster analysis – Partitioning


methods – Density based methods – Model evaluation:
Confusion matrix – Receiver Operator Characteristics (ROC) and
Area under the Curve (AUC) – Lift curves – Evaluating the
Predictions – Implementation

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 6


TEXTBOOKS
1. Vijay Kotu and Bala Deshpande, “Data Science Concepts and Practice”, 2
Edition, Morgan Kaufmann Publishers, 2019.

2. Jiawei Han, Micheline Kamber and Jian Pei, “Data Mining: Concepts and
Techniques”, 3 Edition, Morgan Kaufmann Publishers, 2012.

3. Cathy O’Neil and Rachel Schutt, “Doing Data Science, Straight Talk From
The Frontline”, O’Reilly, 2016.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 7


Reference(s)
1. Mohammed J. Zaki and Wagner Miera Jr, “Data Mining and Analysis:
Fundamental Concepts and Algorithms”, Cambridge University Press, 2014.

2. Matt Harrison, “Learning the Pandas Library: Python Tools for Data
Munging, Analysis and Visualization O’Reilly, 2016.

3. Joel Grus, “Data Science from Scratch: First Principles with Python”, O’Reilly
Media, 2015. 4. Wes McKinney, “Python for Data Analysis: Data Wrangling
with Pandas, NumPy, and IPython”, O’Reilly Media, 2012

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 8


WEB REFERENCES
1. https://nptel.ac.in/courses/106/106/106106179/

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 9


Acknowledgement
Resources are taken from the internet and textbooks

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 10


Introduction to Data Exploration
Data science helps decipher the hidden useful relationships within
data

Data exploration helps with


understanding data better,

to prepare the data in a way that makes advanced analysis possible,

to get the necessary insights from the data faster than using advanced
analytical techniques

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 11


Introduction to Data Exploration
Example:

Simple pivot table functions, computing statistics like mean


and deviation, and plotting data as a line, bar, and scatter
charts are part of data exploration techniques

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 12


Introduction to Data Exploration

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 13


Data Exploration
Data exploration :two types
Descriptive Statistics
Data Visualization

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 14


Data Exploration
Visualization is the process of projecting the data, or parts of it,
into multi-dimensional space or abstract images. All the useful
(and adorable) charts fall under this category

Descriptive statistics is the process of condensing key


characteristics of the dataset into simple numeric metrics.
Some of the common quantitative metrics used are mean,
standard deviation, and correlation

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 15


OBJECTIVES OF DATA EXPLORATION
Data understanding

Data preparation

Data science tasks

Interpreting the results

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 16


Dataset
Iris:
*The Iris dataset contains 150 observations of three different
species, Iris setosa, Iris virginica, and Iris versicolor, with 50
observations each.

*Each observation consists of four attributes: sepal length, sepal


width, petal length, and petal width

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 17


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 18
Dataset: Types of Data
For example, the temperature in weather data can be expressed
as any of the following formats:

● Numeric centigrade (31C, 33.3C) or Fahrenheit (100F, 101.45F)


or on the Kelvin scale

● Ordered labels as in hot, mild, or cold

● Number of days within a year below 0C (10 days in a year below


freezing)

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 19


Dataset: Types of Data
Numeric or Continuous:

Temperature expressed in Centigrade or Fahrenheit is numeric and continuous -


denoted by numbers and take an infinite number of values between digits.

Values are ordered and calculating the difference between the values.

Additive and subtractive mathematical operations and logical comparison


operators like greater than, less than, and equal to, operations can be applied.

An integer is a special form of the numeric data type which does not have
decimals in the value or more precisely does not have infinite values between
consecutive numbers.
19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 20
Dataset: Types of Data
Categorical or Nominal

Nominal attributes are also referred to as categorical attributes. The values of


nominal attributes do not have any meaningful order.

The color of the iris of the human eye is a categorical data type because it takes a
value like black, green, blue, gray, etc.

There is no direct relationship among the data values, and hence, mathematical
operators except the logical or “is equal” operator cannot be applied.

 They are also called a nominal or polynomial data type

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 21


An ordered nominal data type is a special case of a categorical
data type where there is some kind of order among the values.

An example of an ordered data type is temperature expressed


as hot, mild, cold.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 22


DESCRIPTIVE STATISTICS
Descriptive statistics refers to a branch of statistics that
involves summarizing, organizing, and presenting data
meaningfully and concisely

Some examples of descriptive statistics include average


annual income, median home price in a neighborhood, range
of credit scores of a population, etc

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 23


DESCRIPTIVE STATISTICS

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 24


DESCRIPTIVE STATISTICS
Descriptive statistics can be broadly classified into
univariate and
 multivariate exploration

*depending on the number of attributes under analysis

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 25


Univariate Exploration
Univariate data exploration denotes analysis of one
attribute at a time.
Measure of Central Tendency

Mean: The mean is the arithmetic average of all observations


in the dataset. It is calculated by summing all the data points
and dividing by the number of data points.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 26


Univariate Exploration
Median: The median is the value of the central point in the
distribution. The median is calculated by sorting all the
observations from small to large and selecting the mid-point
observation in the sorted list.

Mode: The mode is the most frequently occurring observation. In


the dataset, data points may be repetitive, and the most repetitive
data point is the mode of the dataset.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 27


Measure of Spread
There are two common metrics to quantify spread

Range: The range is the difference between the maximum value and the
minimum value of the attribute.

Deviation: The variance and standard deviation measures the spread, by


considering all the values of the attribute.

Deviation is simply measured as the difference between any given value (xi)
and the mean of the sample (μ).

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 28


The variance is the sum of the squared deviations of all data points divided
by the number of data points.

High standard deviation means the data points are spread widely around the
central point.

Low standard deviation means data points are closer to the central point.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 29


Percentile
Quartiles
Interquartile range (IQR) = Q3 - Q1 = 85 - 41 = 44

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 30


Percentile & Quartiles

The first quartile, Q1, is the same as the 25th percentile, and the

third quartile, Q3, is the same as the 75th percentile. The median,

M, is called both the second quartile and the 50th percentile

To calculate quartiles and percentiles, the data must be ordered

from smallest to largest

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 31


IQR

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 32


Example

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 33


A Formula for Finding the kth
Percentile

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 34


Example

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 35


Skewness
Normal distribution:

The peak of the curve is at the mean, and the data is symmetrically distributed
on either side of it. The mean, median, and mode are equal to each other or lie
close to each other.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 36


Skewness
Skewness is used to measure the level of asymmetry in our graph. It is
the measure of asymmetry that occurs when our data deviates from the
norm.

Sometimes, the normal distribution tends to tilt more on one side. This is
because the probability of data being more or less than the mean is higher
and hence makes the distribution asymmetrical. This also means that the
data is not equally distributed

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 37


Skewness
Two Types:

1. Positively Skewed: In a distribution that is Positively Skewed, the


values are more concentrated towards the right side, and the left
tail is spread out.

mean, median, and mode are always positive.

Mean > Median > Mode

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 38


Skewness
Two Types:

Negatively Skewed: In a Negatively Skewed distribution, the data points are more
concentrated towards the right-hand side of the distribution. This makes the mean,
median, and mode bend toward the right. Hence these values are always negative.

Mode > Median > Mean

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 39


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 40
Skewness
Skewness is a measure of the asymmetry of
the probability distribution of a real-valued
random variable about its mean. The
skewness value can be positive or negative, or
undefined.

skewness is the measure of how much the


probability distribution of a random variable
deviates from the normal distribution

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 41


If this value is between:

1.-0.5 and 0.5, the distribution of the value is almost symmetrical

2.-1 and -0.5, the data is negatively skewed, and if it is between


0.5 to 1, the data is positively skewed. The skewness is
moderate.

3.If the skewness is lower than -1 (negatively skewed) or greater


than 1 (positively skewed), the data is highly skewed.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 42


Example of Skewness
Cricket score is one of the best examples of skewed distribution.

Let us say that during a match, most of the players of a

particular team scored runs above 50, and only a few of them

scored below 10. In such a case, the data is generally

represented with the help of a negatively skewed distribution.

Similarly, a positively skewed distribution can be used if most of

the players of a particular team score badly during a match, and

only a few of them tend to perform well.


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 43
Kurtosis

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 44


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 45
Multivariate Exploration
Multivariate exploration is the study of more
than one attribute in the dataset simultaneously.

Central Data Point

Correlation

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 46


Correlation

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 47


The Pearson correlation coefficient between two attributes x and
y is calculated with the formula:

 sx and sy are the standard deviations of random variables x and y

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 48


DATA VISUALIZATION
Data Visualization encompasses the methods of expressing data
in an abstract visual form

The visual representation of data provides easy comprehension of


complex data with multiple attributes and their underlying
relationships

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 49


Motivation
The motivation for using data visualization includes

Comprehension of dense information:

◦ A simple visual chart can easily include thousands of data points.

◦ By using visuals, the user can understand the big picture, as well as longer term trends that are

extremely difficult to interpret purely by expressing data in numbers.

Relationships::

◦ Visualizing data in Cartesian coordinates enables exploration of the relationships between the

attributes

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 50


Univariate Visualization
Visual exploration starts with investigating one attribute at a
time using univariate charts.
Histogram

Quartile

Distribution Chart

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 51


Histogram
$Histograms (or frequency histograms): “Histos” means pole or mast, and “gram”
means chart

$Plotting histograms is a graphical method for summarizing the distribution of a given


attribute, X

$Techniques to understand the frequency of the occurrence of values.

$ It shows the distribution of the data by plotting the frequency of occurrence in a


range.

$The height of the bar indicates the frequency (i.e., count) of that X value. The
resulting graph is more commonly known as a bar chart.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 52


If X is numeric, the term histogram is preferred.
◦ The range of values for X is partitioned into disjoint consecutive subranges. The
subranges, referred to as buckets or bins, are disjoint subsets of the data distribution
for X.
◦ The range of a bucket is known as the width. Typically, the buckets are of equal width

◦ Example:

◦ For example, a price attribute with a value range of Rs.1 to Rs.200 (rounded up to the
nearest rupees) can be partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so
on.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 53


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 54
Quartile
A box whisker plot is a simple visual way of showing the
distribution of a continuous variable with information such as
quartiles, median, and outliers, overlaid by mean and standard
deviation

The main attraction of box whisker or quartile charts is that


distributions of multiple attributes can be compared side
by side and the overlap between them can be deduced

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 55


Quartile
The quartiles are denoted by Q1, Q2, and Q3 points, which indicate the data
points with a 25% bin size

In a distribution, 25% of the data points will be below Q1, 50% will be below Q2,
and 75% will be below Q3

The Q1 and Q3 points in a box whisker plot are denoted by the edges of the box.

 The Q2 point, the median of the distribution, is indicated by a cross line within
the box. The outliers are denoted by circles at the end of the whisker line.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 56


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 57
Example
Suppose you have the math test results for a class of 15 students. Here are the
results:
91 95 54 69 80 85 88 73 71 70 66 90 86 84 73

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 58


Example

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 59


Example

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 60


Quartile

Outlier

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 61


Distribution Chart

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 62


The normal distribution is also called the Gaussian distribution
or “bell curve” due to its bell shape

Example
These types of charts aim to convey “what is the distribution?” of my
data. For example, did a survey and asked everyone about their age.
 A distribution chart would be useful to visualize the distribution of
ages among respondents.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 63


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 64
Multivariate Visualization
The multivariate visual exploration considers more than one
attribute in the same visual.

These visualizations examine two to four attributes simultaneously


Scatterplot

Scatter Multiple

Scatter Matrix

Bubble Chart

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 65


Scatter Plot
One of the key observations that can be concluded from a scatterplot is
the existence of a relationship between two attributes under inquiry.

 If the attributes are linearly correlated, then the data points align closer
to an imaginary straight line; if they are not correlated, the data points are
scattered.

Apart from basic correlation, scatterplots can also indicate the existence
of patterns or groups of clusters in the data and identify outliers in the
data.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 66


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 67
Scatter Multiple:

A scatter multiple is an enhanced form of a simple scatterplot where more than two dimensions
can be included in the chart and studied simultaneously.

The primary attribute is used for the x-axis coordinate. The secondary axis is shared with more
attributes or dimensions

Scatter Matrix

 If the dataset has more than two attributes, it is important to look at combinations of all the
attributes through a scatterplot. A scatter matrix solves this need by comparing all combinations
of attributes with individual scatterplots and arranging these plots in a matrix

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 68


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 69
Bubble Chart

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 70


Bubble Chart

A bubble chart is a variation of a simple scatterplot with the addition of one


more attribute, which is used to determine the size of the data point.

In the Iris dataset, petal length and petal width are used for x and y-axis,
respectively and sepal width is used for the size of the data point. The color
of the data point represents a species class label

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 71


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 72
Density Chart

Density charts are similar to the scatterplots, with one more dimension
included as a background color.

The data point can also be colored to visualize one dimension, and hence, a
total of four dimensions can be visualized in a density chart.

Example:

petal length is used for the x-axis, sepal length for the y-axis, sepal width for
the background color, and class label for the data point color

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 73


19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 74
High Dimensional Data
High-dimensional data are defined as data in which the number of features (variables
observed), p, are close to or larger than the number of observations (or data points),
n.

The opposite is low-dimensional data in which the number of observations, n, far


outnumbers the number of features, p.

High-dimensional data implies many dimensions/variables/features/columns

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 75


Visualizing High-Dimensional
Data
Visualizing more than three attributes on a two-dimensional medium

(like a paper or screen) is challenging.

This limitation can be overcome by using transformation techniques to

project the high-dimensional data points into parallel axis space

This approach, a Cartesian axis is shared by more than one attribute.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 76


Parallel Chart
A parallel chart visualizes a data point quite innovatively by transforming
or projecting multi-dimensional data into a two-dimensional chart medium

Every attribute or dimension is linearly arranged in one coordinate (x-


axis) and all the measures are arranged in the other coordinate (y-axis).

Since the x-axis is multivariate, each data point is represented as a line in


a parallel space.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 77


Parallel Chart

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 78


Parallel Chart
This visualization is called a parallel axis because all four attributes are
represented in four parallel axes parallel to the y-axis.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 79


Deviation Chart
A deviation chart is very similar to a parallel chart, as it has parallel axes for all the
attributes on the x-axis.

Data points are extended across the dimensions as lines and there is one common y-axis.

Instead of plotting all data lines, deviation charts only show the mean and standard
deviation statistics.

For each class, deviation charts show the mean line connecting the mean of each

attribute; the standard deviation is shown as the band above and below the mean line.

The mean line does not have to correspond to a data point (line).

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 80


Deviation Chart

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 81


Andrews Curve
An Andrews plot belongs to a

family of visualization techniques

where the high-dimensional data

are projected into a vector space

so that each data point takes the

form of a line or curve.

19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 82


Thank You!!!
19/01/25 20IT211- DATA SCIENCE - MRS.S.JANSI RANI, AP(SR.GR)/IT 83

You might also like