0% found this document useful (0 votes)
15 views41 pages

Applied - Data - Science MODULE 3 SEM 8

Uploaded by

Dhruv Suvarna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views41 pages

Applied - Data - Science MODULE 3 SEM 8

Uploaded by

Dhruv Suvarna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

ADS MDULE 3

Q What are data visualizations? Explain data visualization techniques for


exploratory data analysis

ANS

Simply put, data visualizations allow humans to explore data in many different ways
and see patterns and insights that would not be possible when looking at the raw
form. Humans crave narrative and visualizations allow us to pull a story out of our
stores of data.

The phrase “A picture is worth a thousand words” is expressly true when turning
huge piles of data into images a viewer can actually understand and derive meaning
from. Children’s storybooks contain lots of images, but very few words. As kids, we
don’t know many words, but the visuals allow us to easily understand the story.

In our modern digital world, we have huge amounts of data all around us. Data
scientists and ML engineers get most of the data they deal with data in a structured
or unstructured data format, however, it’s difficult for humans to understand and
analyze this. Data visualizations (or graphical representations of data) are vital for
understanding the data. They help users explore data through visual elements like
charts, graphs, plots, maps, and other visualizations.
Different types of exploratory data analysis

In every dataset, we have many variables (also called features, input-variables, or


independent-variables) and target/output variables (also known as labels,
dependent-variables, classes, or class-labels). The data scientist’s job is to
completely understand each feature individually and the relationship between
different features. The goal is to get ready the dataset for ML algorithms
implementation.

We have three methods for exploratory data analysis:

Univariate analysis

In the univariate analysis, each variable is analyzed individually. It will get us to the
complete statistical data for each feature. There are a variety of data visualization
techniques for univariate analysis, including Box Plot, Histogram, PDF, CDF

Bivariate analysis

Bivariate analysis is performed to find the relationship between each feature with
the target variable. Data visualization techniques for bivariate analysis are Scatter
Plot and Heatmap

Multivariate Analysis

As the name signifies, multivariate analysis is performed to understand the


relationship between different features of the dataset. One of the main multivariate
analysis data visualization techniques is the Pair Plot.
Bar Plot

A bar plot is a plot that presents categorical data with rectangular bars. The length or
height of bars is proportional to the frequency of the category. We can count the
values of various categories using bar plots.

Pie Chart

Pie Chart is a circular chart that uses pie slices to show the relative size of data. The
arc length of each pie slice is proportional to the quantity it represents. It works
beautifully on categorical values. There are different variants of pie charts available.

Box-plot

Box-plot gives us a five-number summary of any variable: the minimum, maximum,


the sample median, the first and third quartile. Box-plot helps in measuring two
observations:

1. Skewness of distribution
2. Outliers (Outliers comes outside the box-plot)

Histogram and PDF

A histogram is a graphical representation of the distribution of numerical data. It is


an estimate of the probability distribution of a continuous variable. Histogram
basically represents the number of points that exist for each bin(range of values).
PDF is a Probability Density Function which is basically smoothening of the
histogram.
Scatter Plots

A scatter plot is a plot that shows the relationship between two variables of a data
set.

Heat Map

A heatmap is a graphical representation of data in which data values are represented


as colors. It uses color in order to communicate the correlation between two
variables. Values are between -1 to 1. 1 denotes perfect positive correlation. 0 means
no correlation and -1 means the highest negative correlation.

Line chart

The line chart represents a series of data points connected by a straight line. It is
generally used to visualize data that changes over time.

Q2 different ways to represent data

Ans

Below given are different ways to represent data:


• Pictograph
• Pie-Chart
• Bar Graph
• Line Graph
• Histogram

Pictograph
A pictograph is a representation of data using different images or symbols."
Pictographs in maths typically find application in concepts like data handling. They
help in laying the foundation for the interpretation of data based on pictorial
information.
For example, the pictograph given below depicts the data of the different types of
pizzas that were ordered on a random day.

Pie-Chart

A pie chart is another type of graph that is used to visually display data in a
circular graph. Pie charts are one of the most commonly used graphs for the
representation of data using the attributes of circles, spheres, and angular data to
represent real-world information. Pie charts are circular-shaped charts that record
discrete data whereby pie represents the whole and the slices represent the parts of
the whole.
For example, the pie chart given below depicts the data of the different kinds of
desserts preferred by kids at model school.
Bar-Graph

The bar graph is a specific way of data representation using rectangular


bars, where the length of each bar is proportional to the value the bar represents.
Bar graphs, also referred to as are the pictorial representation of grouped data. Bar
graphs are one of the examples of data handling that are an excellent tool when
representing data that are independent of one another and don’t need to be in any
specific order while being represented.
For example, the bar graph given below depicts the data of the number of
employees in the different departments of Cmath Inc.
Line-Graph

A line graph is a special type of graph used to display change over time as a series
of data points connected by straight line segments on two axes. The line graph,
also called a line chart, helps to determine the relationship between two sets of
values, with one data set always being dependent on the other set.
They are helpful in demonstrating information factors and patterns unmistakably.
Line diagrams can make future expectations about the consequences of
information not yet recorded.
For example, the line graph given below depicts the data of the pass percentage of
Grade 7 students in a scholarship exam from the year 2010-2016.
Histogram

A histogram is defined as a graph with a set of rectangles with bases along with the
intervals between class boundaries. Each rectangle depicts some sort of data and all
the rectangles are adjacent. The heights of rectangles are proportional to
corresponding frequencies of similar as well as for different classes. For example,
the bar graph given below depicts the data of the preference of a particular sport
among people.
Q3 Univariate Plots for Categorical Data. Univariate Plot for Numerical Data

Ans

Univariate Plots

Univariate plots are used to visualize the distribution of a single variable. They are
useful for identifying patterns, outliers, and skewness in the data. Common
univariate plots include histograms, density plots, and box plots.

Univariate Plots for Categorical Data


Univariate plots for categorical data are used to visualize the distribution of a single
categorical variable. They are useful for identifying patterns, frequencies, and
proportions in the data. Common univariate plots for categorical data include bar
plots, count plots, and pie charts.

Bar Plots

A bar plot displays the frequency or proportion of each category in a categorical


variable. Each bar represents a category, with the height of the bar representing the
frequency or proportion of observations in that category.

Count Plots

A count plot is a type of bar plot that displays the count of observations in each
category. It is similar to a bar plot, but instead of displaying the frequency or
proportion, it displays the actual count of observations.

Pie Charts

A pie chart displays the proportion of each category in a categorical variable. Each
slice of the pie represents a category, with the angle of the slice representing the
proportion of observations in that category.

Univariate Plot For Numerical Data

Univariate plots for numerical data are used to visualize the distribution of a single
numerical variable. They are useful for identifying patterns, outliers, and skewness
in the data. Common univariate plots for numerical data include histograms, density
plots, and box plots.
Histograms

A histogram is a bar chart that displays the frequency of a numerical variable’s


values. It is created by dividing the data into intervals, called bins, and counting the
number of observations that fall within each bin.

Density Plots

A density plot is a smoothed version of a histogram. It uses a kernel density


estimation (KDE) to estimate the probability density function of the data. Density
plots are useful for comparing the distribution of two or more datasets.

Box Plots

A box plot, also known as a whisker plot, displays the distribution of a numerical
variable using five number summaries: the minimum, first quartile, median, third
quartile, and maximum. It also shows any potential outliers.

Q Bivariate Plots

Bivariate plots are used to visualize the relationship between two variables. They
are useful for identifying trends, patterns, and correlations. Common bivariate plots
include scatter plots, line plots, and bar plots.

Scatter Plots
A scatter plot displays the relationship between two continuous variables. Each
point on the plot represents an observation, with the x-axis representing one
variable and the y-axis representing the other.

Line Plots

A line plot displays the relationship between two continuous variables, with one
variable represented on the x-axis and the other on the y-axis. Line plots are useful
for showing trends over time.

Bar Plots

A bar plot displays the relationship between two categorical variables. Each bar
represents a category, with the height of the bar representing the value of the second
variable.

Bivariate Plots on Numerical VS Numerical

Bivariate plots for numerical vs. numerical data are used to visualize the
relationship between two numerical variables. They are useful for identifying
trends, patterns, and correlations. Common bivariate plots for numerical vs.
numerical data include scatter plots, line plots, and 2D histograms.

Scatter Plots

A scatter plot displays the relationship between two numerical variables. Each point
on the plot represents an observation, with the x-axis representing one variable and
the y-axis representing the other.
Line Plots

A line plot displays the relationship between two numerical variables, with one
variable represented on the x-axis and the other on the y-axis. Line plots are useful
for showing trends over time.

2D Histograms

A 2D histogram displays the relationship between two numerical variables using a


grid of bins. Each bin represents a range of values for both variables, with the color
of the bin representing the number of observations that fall within that range.

Bivariate plots for Numerical vs. Categorical Data

Bivariate plots for numerical vs. categorical data are used to visualize the
relationship between a numerical variable and a categorical variable. They are
useful for identifying differences, trends, and patterns. Common bivariate plots for
numerical vs. categorical data include box plots, violin plots, and swarm plots.

Box Plots

A box plot displays the distribution of a numerical variable for each category in a
categorical variable. Each box represents the interquartile range (IQR) of the
numerical variable for that category, with the line in the middle representing the
median.

Violin Plots
A violin plot displays the distribution of a numerical variable for each category in a
categorical variable, similar to a box plot. However, instead of showing a box and
whiskers, it shows a kernel density estimate of the distribution.

Swarm Plots

A swarm plot displays the individual observations of a numerical variable for each
category in a categorical variable. Each point represents an observation, with the
position along the x-axis representing the category and the position along the y-axis
representing the value of the numerical variable.

Bivariate Plots on Categorical VS Categorical

Bivariate plots for categorical vs. categorical data are used to visualize the
relationship between two categorical variables. They are useful for identifying
patterns, frequencies, and associations. Common bivariate plots for categorical vs.
categorical data include contingency tables, mosaic plots, and stacked bar plots.

Contingency Tables

A contingency table displays the frequency of observations for each combination of


categories in two categorical variables. It is a tabular representation of the joint
distribution of the two variables.

Mosaic Plots

A mosaic plot displays the frequency of observations for each combination of


categories in two categorical variables, similar to a contingency table. However,
instead of showing a table, it shows a graphical representation of the joint
distribution.

Stacked Bar Plots


A stacked bar plot displays the frequency of observations for each category in one
categorical variable, with the bars stacked to show the frequency of observations for
each category in the other categorical variable.

Q Define Multivariate analysis and discuss Multivariate analysis techniques.

Ans

MultiVariate Analysis

Multivariate analysis is a statistical technique that involves analyzing multiple


variables simultaneously. It is used to identify patterns, relationships, and structures
in complex datasets. Multivariate analysis can be used for various purposes, such as
data reduction, classification, and prediction.

There are several types of multivariate analysis techniques, including:

1. Principal Component Analysis (PCA): PCA is a technique used for data


reduction. It transforms the original variables into a new set of variables called
principal components, which are linear combinations of the original variables. The
principal components are ordered so that the first few components capture most of
the variation in the data.
2. Factor Analysis: Factor analysis is a technique used for data reduction. It
identifies underlying factors that explain the correlations among the original
variables. The factors are linear combinations of the original variables.
3. Cluster Analysis: Cluster analysis is a technique used for grouping observations
based on their similarities. It identifies clusters of observations that are similar to
each other and different from observations in other clusters.
4. Discriminant Analysis: Discriminant analysis is a technique used for
classification. It identifies the variables that discriminate between two or more
groups. It can be used for predicting the group membership of new observations.
5. Multiple Linear Regression: Multiple linear regression is a technique used for
prediction. It models the relationship between a dependent variable and multiple
independent variables. It can be used for predicting the value of the dependent
variable based on the values of the independent variables.
6. Logistic Regression: Logistic regression is a technique used for classification. It
models the relationship between a binary dependent variable and multiple
independent variables. It can be used for predicting the probability of the dependent
variable based on the values of the independent variables.
7. Canonical Correlation Analysis: Canonical correlation analysis is a technique
used for identifying the relationships between two sets of variables. It identifies the
linear combinations of the variables in each set that have the highest correlation.

These are some of the commonly used multivariate analysis techniques. The choice
of technique depends on the research question, the type of data, and the goals of the
analysis.

Q6 Define Stem and Leaf plot. List the important parts of a stem and leaf diagram.

. Construct back-to-back stem -leaf plot

Ans
A Stem and Leaf Plot is a special table where each data value is split into a "stem"
(the first digit or digits) and a "leaf" (usually the last digit).
Like in this example: • Stem and leaf diagrams are a pictorial way of showing
statistics •
The important parts of a stem and leaf diagram are

How to construct a stem and leaf plot •


Example #1: 24, 10, 13, 2, 28, 34, 65, 67, 55, 34, 25, 59, 8, 39, 61
First, put this data in order 2, 6, 10, 13, 24, 25, 28, 34, 34, 39, 55, 59, 61, 65, 67
We will use 0, 1, 2, 3, 4, 5, and 6 as stems.
The plot is displayed below:

Sometimes, it is useful to show leafs on both sides of the stem.


Say for instance you teach algebra in two different classes.
You may in this case want to compare performance for the classes to see which
class performed better.

Grade for class A: 60, 68, 70, 75, 84, 86, 90, 91, 92, 94, 94, 96, 100, 100
Grade for class B: 60, 60, 70, 71, 73, 73, 75, 76, 77, 84, 85, 86, 91, 92
The plot is displayed below:

Q7 Data Visualization technique- scatter plot matrix

Ans

A scatter plot matrix is a grid (or matrix) of scatter plots used to visualize bivariate
relationships between combinations of variables. Each scatter plot in the matrix
visualizes the relationship between a pair of variables, allowing many relationships
to be explored in one chart.

Variables
A scatter plot matrix is made up of three or more numeric fields. A scatter plot is
created for every pairwise combination of variables selected.

Scatterplot matrices are a great way to roughly determine if you have a linear
correlation between multiple variables. This is particularly helpful in pinpointing
specific variables that might have similar correlations to your genomic or
proteomic data.

▪ The trees dataset seems to contain three columns of


measurements: Girth, Height and Volume.
This is an example of a scatterplot matrix. The variables are written in a diagonal
line from top left to bottom right. Then each variable is plotted against each other.
For example, the middle square in the first column is an individual scatterplot of
Girth and Height, with Girth as the X-axis and Height as the Y-axis. This same
plot is replicated in the middle of the top row. In essence, the boxes on the upper
right hand side of the whole scatterplot are mirror images of the plots on the lower
left hand
In this scatterplot, it is probably safe to say that there is a correlation between Girth
and Volume (Go data! Confirming the obvious) because the plot looks like a line.
There is probably less of a correlation between Height and Girth in addition to
Height and Volume. More statistical analyses would be needed to confirm or deny
this.

• Scatterplot matrix is an extension for multidimensional data where a


collection of scatterplots is organized in a matrix simultaneously to provide
correlation information among the attributes.

• We can easily observe patterns in the relationships between pairs of


attributes from the matrix.

Scatterplot Matrix

• Organizes all the pairwise scatterplots in a matrix format


• Each display panel in the matrix is identified by its row and column
coordinates
• The panel at the ith row and jth column is a scatterplot of Xj versus
Xi
The panel at the 3rd row (the top row) and 1st column is a scatterplot of Z versus X

• Panels that are symmetric with respect to the XYZ diagonal have the same
variables as their coordinates, rotated 90°
• The redundancy is designed to improve visual linking

• Patterns can be detected in both horizontal and vertical directions

• Can only visualize the correlation between two variables, without using
retinal visual elements or interaction techniques
Q8 Data Visualization technique- Bubble Chart

Ans

• An extension of a scatterplot, a bubble chart is commonly used to visualize


relationships between three or more numeric
variables. Each bubble in a chart represents a single data point.

• The values for each bubble are encoded by

• 1) its horizontal position on the x-axis,

• 2) its vertical position on the y-axis,

• and 3) the size of the bubble.

• Sometimes, the color of the bubble or its movement in animation can


represent more dimensions.
How do you interpret a bubble chart?

• Imagine you work for a global organization and are gathering some data for a
competitive analysis.

• You’ve pulled together the following table, which shows each


competitor’s market share, sales volume, year-over-year sales growth,
and its primary region (North America, Europe/Middle East/Africa, and
Asia-Pacific).
scatterplot

• The first two columns of data (market share and sales volume for each
competitor) are displayed in the graph below.
• This is called a scatterplot, which visualizes the relationship between two
series: the x-axis (market share) and the y-axis (sales volume).

• A bubble chart expands on the scatterplot by adding additional


dimensions.

• We can add YoY sales growth as the third dimension, encoded by the size
of the bubbles:
• This third dimension gives a visual sense of how much the
competitors differ from each other with respect to their sales change:
the higher the growth, the larger the bubble.

• We could even take it one step further and encode the fourth variable
(Region) by color:
Q Run Chart
Ans

• A run chart is a line graph of data plotted over time.

• By collecting and charting data over time, you can find trends or patterns
in the process.

• It is a line graph showing a measure in chronological order, with the


measure on the vertical (y) axis and time or observation number on the
horizontal (y) axis.

• The median of the data points (the middle value) is added once 10 or so data
points are available.
• Changes made to a process, and other useful annotations, are also often
marked on the graph so that they can be connected with the impact on the
process.


• A graph/chart that displays data over time

Important Elements of Run Chart

1. TITLE

• Measure value per Unit Time

2. AXES

• X-axis with time intervals in which metrics are captured (e.g days,
months, year)

• Y-axis with value for measure of interest


• Labelled Axes notating correct units of measurement

• Must do outcome measure; process and balancing optional, but ideal

3. DATA

At least 10 data points, but ideal is 12: six before intervention, six after

Annotations so that chart behavior can be considered in context of important


project events

4. MEDIAN LINE (Centreline)

Median = point at which half the numbers are above and half are below the
centreline

Q Data Visualization technique- Dot plot

Ans

• Dot plot or dot graph is just one of the many types of graphs and charts to
organize statistical data.

• It uses dots to represent data.

• A Dot Plot is used for relatively small sets of data and the values fall into a
number of discrete categories.

• If a value appears more than one time, the dots are ordered one above the
other.
• That way the column height of dots shows the frequency for that value.

• To plot frequency counts when you have a small number of categories.

• Dot plots are very useful when the variable is quantitative or


categorical.

• Dot graphs are also used for univariate data (data with only one variable
that you can measure).
• Example:

• Suppose you have a class of 26 students.

• They are asked to tell their favorite color.

• The dot plot below represents their choices:

It is obvious that blue is the most preferred color by the students in this class.

• The pattern of data in a dot plot can be described in terms of


symmetry and skewness only if the categories are quantitative.

• Dot plots are used most often to plot frequency counts within a small
number of categories, usually with small sets of data.
• Dot plots are great ways to allow us to identify the spread of the data and the
mode of the data.

Dot Plots – How to draw…

1. Determine the highest and lowest values.

2. Draw a number line that starts at the lowest and finishes at the highest.

3. Now place a dot above the number for the first data entry and then a dot
above the next number for the second data entry and so on.

4. If you get to a value that already has a dot then put another dot above this
one.

5. The dots need to be evenly spaced to give an accurate picture.

a
Q What is Ogive graph and explain types of Ogive graph

Ans
Q

Question: EXPLAIN : cross-validation, K-fold cross-validation, leave-1 out and


Bootstrapping
Ans

Cross-validation, K-fold cross-validation, leave-one-out cross-validation, and


bootstrapping are all techniques used in machine learning to assess model
performance and generalization ability. Cross-validation is the general concept,
while the others are specific implementations. K-fold splits the data into k subsets,
training on k-1 and testing on the remaining one, repeating this k times. Leave-one-
out uses k=n (where n is the data size), training on n-1 and testing on the remaining
one. Bootstrapping creates multiple training sets by randomly sampling with
replacement from the original data.

Here's a more detailed explanation:

1. Cross-validation:

• A general resampling technique used to evaluate the performance of a model


on unseen data.

• It helps to estimate how well the model will generalize to new data and
prevent overfitting.

• Involves splitting the data into multiple subsets and using different
combinations for training and testing.

• The results from multiple iterations are then averaged to get a more robust
performance estimate.

2. K-fold Cross-validation:

• A specific type of cross-validation where the data is divided into k equal


subsets (folds).
• Each fold is used once as a validation set, while the remaining k-1 folds are
used for training.

• This process is repeated k times, with each fold serving as the validation set
once.

• The average performance across all k iterations provides a more reliable


estimate of the model's performance.

3. Leave-one-out Cross-validation (LOOCV):

• A special case of K-fold cross-validation where k is equal to the number of


data points (n).

• Each data point is used as a validation set once, while the remaining n-1
points are used for training.

• This results in a very high number of iterations (n), ensuring that every data
point contributes to the performance evaluation.

• LOOCV can be computationally expensive, but it provides a very unbiased


estimate of the model's performance, according to Springer.

4. Bootstrapping:

• A resampling technique that creates multiple datasets by sampling with


replacement from the original dataset.

• In each bootstrap sample, data points are randomly selected, potentially


including some data points multiple times and excluding others entirely.

• This creates a collection of different training sets, which can be used to train
and evaluate models.
• Bootstrapping can be used for estimating the variability of model parameters
and creating confidence intervals.

The following table shows the time taken (in minutes) by 100 students to
travel to school on a particular day

Draw the histogram. Also find:

(i) The number of students who travel to school within 15 minutes.

(ii) Number of students whose travelling time is more than 20 minutes.

Solution:

Since we are displaying the distribution of time taken (in minutes) by 100 students
to travel to school on a particular day in visual form, the histogram is drawn.

Step 1 : Time taken are marked along the X-axis and labeled as “Time (in
minutes)”.

Step 2 : Number of students are marked along the Y-axis and labeled as “No. of
students”.

Step 3 : Corresponding to each time taken, a vertical attached bar is drawn whose
height is proportional to the number of students.

The Histogram is presented in fig

(i) 5+25+40=70 students travel to school within 15 minutes


(ii) 13 students travelling time is more than 20 minutes

You might also like