Applied - Data - Science MODULE 3 SEM 8
Applied - Data - Science MODULE 3 SEM 8
ANS
Simply put, data visualizations allow humans to explore data in many different ways
and see patterns and insights that would not be possible when looking at the raw
form. Humans crave narrative and visualizations allow us to pull a story out of our
stores of data.
The phrase “A picture is worth a thousand words” is expressly true when turning
huge piles of data into images a viewer can actually understand and derive meaning
from. Children’s storybooks contain lots of images, but very few words. As kids, we
don’t know many words, but the visuals allow us to easily understand the story.
In our modern digital world, we have huge amounts of data all around us. Data
scientists and ML engineers get most of the data they deal with data in a structured
or unstructured data format, however, it’s difficult for humans to understand and
analyze this. Data visualizations (or graphical representations of data) are vital for
understanding the data. They help users explore data through visual elements like
charts, graphs, plots, maps, and other visualizations.
Different types of exploratory data analysis
Univariate analysis
In the univariate analysis, each variable is analyzed individually. It will get us to the
complete statistical data for each feature. There are a variety of data visualization
techniques for univariate analysis, including Box Plot, Histogram, PDF, CDF
Bivariate analysis
Bivariate analysis is performed to find the relationship between each feature with
the target variable. Data visualization techniques for bivariate analysis are Scatter
Plot and Heatmap
Multivariate Analysis
A bar plot is a plot that presents categorical data with rectangular bars. The length or
height of bars is proportional to the frequency of the category. We can count the
values of various categories using bar plots.
Pie Chart
Pie Chart is a circular chart that uses pie slices to show the relative size of data. The
arc length of each pie slice is proportional to the quantity it represents. It works
beautifully on categorical values. There are different variants of pie charts available.
Box-plot
1. Skewness of distribution
2. Outliers (Outliers comes outside the box-plot)
A scatter plot is a plot that shows the relationship between two variables of a data
set.
Heat Map
Line chart
The line chart represents a series of data points connected by a straight line. It is
generally used to visualize data that changes over time.
Ans
Pictograph
A pictograph is a representation of data using different images or symbols."
Pictographs in maths typically find application in concepts like data handling. They
help in laying the foundation for the interpretation of data based on pictorial
information.
For example, the pictograph given below depicts the data of the different types of
pizzas that were ordered on a random day.
Pie-Chart
A pie chart is another type of graph that is used to visually display data in a
circular graph. Pie charts are one of the most commonly used graphs for the
representation of data using the attributes of circles, spheres, and angular data to
represent real-world information. Pie charts are circular-shaped charts that record
discrete data whereby pie represents the whole and the slices represent the parts of
the whole.
For example, the pie chart given below depicts the data of the different kinds of
desserts preferred by kids at model school.
Bar-Graph
A line graph is a special type of graph used to display change over time as a series
of data points connected by straight line segments on two axes. The line graph,
also called a line chart, helps to determine the relationship between two sets of
values, with one data set always being dependent on the other set.
They are helpful in demonstrating information factors and patterns unmistakably.
Line diagrams can make future expectations about the consequences of
information not yet recorded.
For example, the line graph given below depicts the data of the pass percentage of
Grade 7 students in a scholarship exam from the year 2010-2016.
Histogram
A histogram is defined as a graph with a set of rectangles with bases along with the
intervals between class boundaries. Each rectangle depicts some sort of data and all
the rectangles are adjacent. The heights of rectangles are proportional to
corresponding frequencies of similar as well as for different classes. For example,
the bar graph given below depicts the data of the preference of a particular sport
among people.
Q3 Univariate Plots for Categorical Data. Univariate Plot for Numerical Data
Ans
Univariate Plots
Univariate plots are used to visualize the distribution of a single variable. They are
useful for identifying patterns, outliers, and skewness in the data. Common
univariate plots include histograms, density plots, and box plots.
Bar Plots
Count Plots
A count plot is a type of bar plot that displays the count of observations in each
category. It is similar to a bar plot, but instead of displaying the frequency or
proportion, it displays the actual count of observations.
Pie Charts
A pie chart displays the proportion of each category in a categorical variable. Each
slice of the pie represents a category, with the angle of the slice representing the
proportion of observations in that category.
Univariate plots for numerical data are used to visualize the distribution of a single
numerical variable. They are useful for identifying patterns, outliers, and skewness
in the data. Common univariate plots for numerical data include histograms, density
plots, and box plots.
Histograms
Density Plots
Box Plots
A box plot, also known as a whisker plot, displays the distribution of a numerical
variable using five number summaries: the minimum, first quartile, median, third
quartile, and maximum. It also shows any potential outliers.
Q Bivariate Plots
Bivariate plots are used to visualize the relationship between two variables. They
are useful for identifying trends, patterns, and correlations. Common bivariate plots
include scatter plots, line plots, and bar plots.
Scatter Plots
A scatter plot displays the relationship between two continuous variables. Each
point on the plot represents an observation, with the x-axis representing one
variable and the y-axis representing the other.
Line Plots
A line plot displays the relationship between two continuous variables, with one
variable represented on the x-axis and the other on the y-axis. Line plots are useful
for showing trends over time.
Bar Plots
A bar plot displays the relationship between two categorical variables. Each bar
represents a category, with the height of the bar representing the value of the second
variable.
Bivariate plots for numerical vs. numerical data are used to visualize the
relationship between two numerical variables. They are useful for identifying
trends, patterns, and correlations. Common bivariate plots for numerical vs.
numerical data include scatter plots, line plots, and 2D histograms.
Scatter Plots
A scatter plot displays the relationship between two numerical variables. Each point
on the plot represents an observation, with the x-axis representing one variable and
the y-axis representing the other.
Line Plots
A line plot displays the relationship between two numerical variables, with one
variable represented on the x-axis and the other on the y-axis. Line plots are useful
for showing trends over time.
2D Histograms
Bivariate plots for numerical vs. categorical data are used to visualize the
relationship between a numerical variable and a categorical variable. They are
useful for identifying differences, trends, and patterns. Common bivariate plots for
numerical vs. categorical data include box plots, violin plots, and swarm plots.
Box Plots
A box plot displays the distribution of a numerical variable for each category in a
categorical variable. Each box represents the interquartile range (IQR) of the
numerical variable for that category, with the line in the middle representing the
median.
Violin Plots
A violin plot displays the distribution of a numerical variable for each category in a
categorical variable, similar to a box plot. However, instead of showing a box and
whiskers, it shows a kernel density estimate of the distribution.
Swarm Plots
A swarm plot displays the individual observations of a numerical variable for each
category in a categorical variable. Each point represents an observation, with the
position along the x-axis representing the category and the position along the y-axis
representing the value of the numerical variable.
Bivariate plots for categorical vs. categorical data are used to visualize the
relationship between two categorical variables. They are useful for identifying
patterns, frequencies, and associations. Common bivariate plots for categorical vs.
categorical data include contingency tables, mosaic plots, and stacked bar plots.
Contingency Tables
Mosaic Plots
Ans
MultiVariate Analysis
These are some of the commonly used multivariate analysis techniques. The choice
of technique depends on the research question, the type of data, and the goals of the
analysis.
Q6 Define Stem and Leaf plot. List the important parts of a stem and leaf diagram.
Ans
A Stem and Leaf Plot is a special table where each data value is split into a "stem"
(the first digit or digits) and a "leaf" (usually the last digit).
Like in this example: • Stem and leaf diagrams are a pictorial way of showing
statistics •
The important parts of a stem and leaf diagram are
Grade for class A: 60, 68, 70, 75, 84, 86, 90, 91, 92, 94, 94, 96, 100, 100
Grade for class B: 60, 60, 70, 71, 73, 73, 75, 76, 77, 84, 85, 86, 91, 92
The plot is displayed below:
Ans
A scatter plot matrix is a grid (or matrix) of scatter plots used to visualize bivariate
relationships between combinations of variables. Each scatter plot in the matrix
visualizes the relationship between a pair of variables, allowing many relationships
to be explored in one chart.
Variables
A scatter plot matrix is made up of three or more numeric fields. A scatter plot is
created for every pairwise combination of variables selected.
Scatterplot matrices are a great way to roughly determine if you have a linear
correlation between multiple variables. This is particularly helpful in pinpointing
specific variables that might have similar correlations to your genomic or
proteomic data.
Scatterplot Matrix
• Panels that are symmetric with respect to the XYZ diagonal have the same
variables as their coordinates, rotated 90°
• The redundancy is designed to improve visual linking
• Can only visualize the correlation between two variables, without using
retinal visual elements or interaction techniques
Q8 Data Visualization technique- Bubble Chart
Ans
• Imagine you work for a global organization and are gathering some data for a
competitive analysis.
• The first two columns of data (market share and sales volume for each
competitor) are displayed in the graph below.
• This is called a scatterplot, which visualizes the relationship between two
series: the x-axis (market share) and the y-axis (sales volume).
• We can add YoY sales growth as the third dimension, encoded by the size
of the bubbles:
• This third dimension gives a visual sense of how much the
competitors differ from each other with respect to their sales change:
the higher the growth, the larger the bubble.
• We could even take it one step further and encode the fourth variable
(Region) by color:
Q Run Chart
Ans
• By collecting and charting data over time, you can find trends or patterns
in the process.
• The median of the data points (the middle value) is added once 10 or so data
points are available.
• Changes made to a process, and other useful annotations, are also often
marked on the graph so that they can be connected with the impact on the
process.
•
• A graph/chart that displays data over time
1. TITLE
2. AXES
• X-axis with time intervals in which metrics are captured (e.g days,
months, year)
3. DATA
At least 10 data points, but ideal is 12: six before intervention, six after
Median = point at which half the numbers are above and half are below the
centreline
Ans
• Dot plot or dot graph is just one of the many types of graphs and charts to
organize statistical data.
• A Dot Plot is used for relatively small sets of data and the values fall into a
number of discrete categories.
• If a value appears more than one time, the dots are ordered one above the
other.
• That way the column height of dots shows the frequency for that value.
• Dot graphs are also used for univariate data (data with only one variable
that you can measure).
• Example:
It is obvious that blue is the most preferred color by the students in this class.
• Dot plots are used most often to plot frequency counts within a small
number of categories, usually with small sets of data.
• Dot plots are great ways to allow us to identify the spread of the data and the
mode of the data.
2. Draw a number line that starts at the lowest and finishes at the highest.
3. Now place a dot above the number for the first data entry and then a dot
above the next number for the second data entry and so on.
4. If you get to a value that already has a dot then put another dot above this
one.
a
Q What is Ogive graph and explain types of Ogive graph
Ans
Q
1. Cross-validation:
• It helps to estimate how well the model will generalize to new data and
prevent overfitting.
• Involves splitting the data into multiple subsets and using different
combinations for training and testing.
• The results from multiple iterations are then averaged to get a more robust
performance estimate.
2. K-fold Cross-validation:
• This process is repeated k times, with each fold serving as the validation set
once.
• Each data point is used as a validation set once, while the remaining n-1
points are used for training.
• This results in a very high number of iterations (n), ensuring that every data
point contributes to the performance evaluation.
4. Bootstrapping:
• This creates a collection of different training sets, which can be used to train
and evaluate models.
• Bootstrapping can be used for estimating the variability of model parameters
and creating confidence intervals.
The following table shows the time taken (in minutes) by 100 students to
travel to school on a particular day
Solution:
Since we are displaying the distribution of time taken (in minutes) by 100 students
to travel to school on a particular day in visual form, the histogram is drawn.
Step 1 : Time taken are marked along the X-axis and labeled as “Time (in
minutes)”.
Step 2 : Number of students are marked along the Y-axis and labeled as “No. of
students”.
Step 3 : Corresponding to each time taken, a vertical attached bar is drawn whose
height is proportional to the number of students.