0% found this document useful (0 votes)
4 views16 pages

Data Visualization

The document provides a comprehensive overview of data visualization, emphasizing its importance in data science for tasks such as data cleaning, exploration, and presenting results. It discusses various types of visualizations, the workflow for creating effective visualizations, and tools available for data visualization. Additionally, it highlights best practices, essential skills needed, and answers frequently asked questions related to data visualization.

Uploaded by

Abhishek Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

Data Visualization

The document provides a comprehensive overview of data visualization, emphasizing its importance in data science for tasks such as data cleaning, exploration, and presenting results. It discusses various types of visualizations, the workflow for creating effective visualizations, and tools available for data visualization. Additionally, it highlights best practices, essential skills needed, and answers frequently asked questions related to data visualization.

Uploaded by

Abhishek Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Visualization

1. What is Data Visualization?


2. Importance of Data Visualization in Data Science
3. Different Types of Data Visualization
4. Data Visualization Process/Workflow
5. Tools and Software for Data Visualization
6. Data Visualization Techniques in Data Science
7. Advantages and Disadvantages of Data Visualization
8. Examples of Data Visualization in Data Science
9. Data Visualization Best Practices
10. Essential Skills for Data Visualization
11. Conclusion
12. Frequently Asked Questions (FAQs)

A picture is worth more than thousands of words. People like to see


pictures rather than read words. That’s why visualization matters in all
data science project lifecycle steps. From data understanding to model
validation, data visualization plays an important role.

There are state-of-the-art technologies to make data visualization much


easier and more effective. We need to follow some standard workflow
to create good visualization which everyone can understand. All of them
will be discussed here. You will also see different data visualization
graphs with their relevant use cases.
What is Data Visualization?
Importance of Data Visualization in Data Science
Earlier, I mentioned the importance of data visualization in data science.
Here are some more details.

1. Data cleaning

Data visualization plays an important role in data clearing. Good


examples are detecting outliers and removing multicollinearity. We can
create scatterplots to detect outliers and generate heatmaps to check
multicollinearity.

2. Data Exploration

Before building any model, we need to do some exploratory data


analysis to identify dataset characteristics. For example, we can create
histograms for continuous variables to check for normality in the data.
We can create scatterplots between two features to check whether they
are correlated. Likewise, we can create a bar chart for the label column
with two or more classes to identify class imbalance.

3. Evaluation of modeling outputs

We can create a confusion matrix and learning curve to measure the


performance of a model during training. Plots are also useful in
validating model assumptions. For example, we can create a residuals
plot and histogram for the distribution of residuals to validate the
assumptions of a linear regression model.

4. Identifying trends

Time and seasonal plots are useful in time series analysis to identify
certain trends over time.
5. Presenting results

As a data scientist, you need to present your findings to the company or


other related persons who do not have more knowledge in the subject
domain. So, you need to explain everything in plain English. You can
use informative plots that summarize your findings. Are you interested
in data visualization?

Different Types of Data Visualization


There are many data visualization types. The following are the
commonly used data visualization charts.

1. Distribution plot

A distribution plot is used to visualize data distribution. Example:


Probability distribution plot or density curve.

Source: seaborn.pydata.org

2. Box and whisker plot

This plot is used to plot the variation of the values of a numerical feature.
You can get the values' minimum, maximum, median, lower and upper
quartiles.
3. Violin plot

Similar to the box and whisker plot, the violin plot is used to plot the
variation of a numerical feature. But it contains a kernel density curve in
addition to the box plot. The kernel density curve estimates the
underlying distribution of data.

Source: seaborn.pydata

4. Line plot

A line plot is created by connecting a series of data points with straight


lines. The number of periods is on the x-axis.
5. Bar plot

A bar plot is used to plot the frequency of occurring categorical data.


Each category is represented by a bar. The bars can be created
vertically or horizontally. Their heights or lengths are proportional to the
values they represent.

6. Scatter plot

Scatter plots are created to see whether there is a relationship (linear or


non-linear and positive or negative) between two numerical variables.
They are commonly used in regression analysis.
7. Histogram

A histogram represents the distribution of numerical data. Looking at a


histogram, we can decide whether the values are normally distributed (a
bell-shaped curve), skewed to the right or skewed left. A histogram of
residuals is useful to validate important assumptions in regression
analysis.
8. Pie chart

A categorical variable pie chart includes each category's values as slices


whose sizes are proportional to the quantity they represent. It is a
circular graph made with slices equal to the number of categories.

9. Area plot

The area plot is based on the line chart. We get the area plot when we
cover the area between the line and the x-axis.
Source: python-graph-gallery.com

10. Hexbin plot

Similar to the scatter plot, a hexbin plot represents the relationship


between two numerical variables. It is useful when there are a lot of data
points in the two variables. When you have a lot of data points, they will
overlap when represented in a scatter plot.

Source: python-graph-gallery.com

11. Heatmap

A heatmap visualizes the correlation coefficients of numerical features


with a beautiful color map. Light colors show a high correlation, while
dark colors show a low correlation. The heatmap is extremely useful for
identifying multicollinearity that occurs when the input features are
highly correlated with one or more of the other features in the dataset.
Do you want to be familiar with these plot types and many other things
in data science?

Data Visualization Process/Workflow

The data visualization process or workflow includes the fowling key


steps.

1. Develop your research question

This may be a business problem or any other related problem that could
be solved with a data-driven approach. You should note all the
objectives and outcomes plus required resources such as datasets,
open-source software libraries, etc.

2. Get or create your data

The next step is collecting data. You can use existing datasets if they’re
relevant to your research question. Alternatively, you can
download open-source datasets from the internet or do web scraping to
collect data.

3. Clean your data


Real-world data are messy. So, you need to clean them before using
them for visualization. You can identify missing values and outliers and
treat them accordingly. You can perform feature selection and remove
unnecessary features from the data. You can create a new set of
features based on the original features.

4. Choose a chart type

The chart type depends on many factors. For example, it depends on


the feature type (numerical or categorical). It also depends on the type
of visualization you need. Let’s say you have two numerical features. If
you want to find their distributions, you can create two histograms for
each feature. If you want to plot their variations, you can create box and
whisker plots for each feature. You can create a scatterplot if you want
to find a relationship (linear or non-linear, positive or negative) between
the two features.

5. Choose your tool

You can use open-source data visualization tools such as matplotlib,


seaborn, plotty and ggplot. You can also use API-based software such
as Matlab, Minitab, SPSS, etc.

6. Prepare data

You can extract relevant features. You can do feature standardization if


the values of the features are not on the same scale. You can apply data
preprocessing steps such as PCA to reduce the dimensionality of the
data. That will allow you to visualize high-dimensional data in 2D and
3D plots!

7. Create a chart

This is the final step. Here. You define the title and names for the axes.
You should also choose a proper chart background to ensure the
content is easily readable.
Tools and Software for Data Visualization

There are multiple tools and software available for data visualization.

1. Python provides open-source libraries such as

• Matplotlib
• Seaborn
• Plotty
• Bokeh
• Altair
2. R provides open-source libraries such as

• Ggplot2
• Lattice
3. Other data visualization libraries

• IBM SPSS
• Minitab
• Matlab for data visualization
• Tableau
• Microsoft Power BI are popular among data scientists.
Tableau and Microsoft Power BI are popular among data scientists.

Data Visualization Techniques in Data Science


Some of the main data visualization techniques in data science are
univariate analysis, bivariate analysis and multivariate analysis.

1. Univariate Analysis

In univariate analysis, as the name suggest, we analyze only one


variable at a time. In other words, we analyze each variable separately.
Bar charts, pie charts, box plots and histograms are common examples
of univariate data visualization. Bar charts and pie charts are created for
categorical variables, while box plots and histograms are created for
numerical variables.
2. Bivariate Analysis

In bivariate analysis, we analyze two variables at a time. Often, we see


whether there is a relationship between the two variables. The scatter
plot is a classic example of bivariate data visualization.

3. Multivariate Analysis

In multivariate analysis, we analyze more than two variables


simultaneously. The heatmap is a classic example of multivariate data
visualization. Other examples are cluster analysis and principal
component analysis (PCA).

Advantages and Disadvantages of Data Visualization


Advantages

There are many advantages of data visualization. Data visualization is


used to:

• Communicate your results or findings with your audience


• Tune hyperparameters
• Identify trends, patterns and correlations between variables
• Monitor the model’s performance
• Clean data
• Validate the model’s assumptions
Disadvantages

There are also some disadvantages of data visualization.

• We need to download, install and configure software and open-


source libraries. The process will be difficult and time-consuming for
beginners.
• Some data visualization tools are not available for free. We need to
pay for those.
• When we summarize the data, we’ll lose the exact information.
Examples of Data Visualization in Data Science

Here are some popular data visualization examples.

1. Weather reports: Maps and other plot types are commonly used in
weather reports.
2. Internet websites: Social media analytics websites such as Social
Blade and Google Analytics use data visualization techniques to
analyze and compare the performance of websites.
3. Astronomy: NASA uses advanced data visualization techniques in
its reports and presentations.
4. Geography
5. Gaming industry

Data Visualization Best Practices


1. Set the context

We need to develop a research question that could be solved with a


data-driven approach.

2. Know your audience

This is very important as the visualizations depend on the type of


audience you have. To present your findings to a business people
audience, you need to create visualizations closely related to money,
profits, and revenue the terms that business people are familiar with!

3. Choose an effective visual

You need to create the right plot that addresses your requirement. To
see the correlations between multiple variables, you can create
histograms for each pair of variables. But that is not very effective.
Instead, you can create a heatmap that is an effective way of visualizing
correlations. When you have many categories, the pie chart is not
suitable. Instead, you can create a bar chart. These are some examples
of choosing an effective visual for your requirements.
4. Keep it simple

Simple plots are easily readable. We can remove unnecessary


backgrounds to make things stand out. We should not include much
content in the plot. Title, names for axis, scale, and legends are just
enough.

Essential Skills for Data Visualization


You should have the following data visualization skills for effective data
visualization.

1. Programming

You should know R or Python language. R wins, hands down, when it


comes to data visualization. Its ggplot2 library provides high-level
functions to make complex plots with less code. Data visualization in
Python can be done using libraries like matplotlib, plotty, bokeh and
seaborn for data visualization. Plotty and bokeh can be used for
interactive data visualizations.

2. Software Expertise

In addition to using R or Python languages, you can also use data


visualization software such as Matlab, Minitab and SPSS for data
visualization. Data visualization in Excel is also popular. However, they
provide limited customizations for your plots. In addition to that, you
cannot automate the plot creation process as you can do it with Python
or R.

3. Data Science Skills

Data visualization is one of the data science skills. But, for effective data
visualization, you need other data science skills such as statistical
analysis, data cleaning, processing large data sets, data mining, etc.
Data visualization cannot be done alone. It is a collection of these skills.
4. Public Speaking and Presentation

When it comes to presenting your findings to the company or other


related people, you need to have excellent presentation skills. You
should have more confidence when explaining things to a larger
audience. For that, you should be familiar with the given problem
domain.

5. Machine Learning

Machine learning is the ability of computers to learn from data without


being explicitly programmed. It is completely different from traditional
programming. We can use machine learning algorithms to find important
patterns and features in the data. Then, we can visualize those things.
There are machine learning algorithms that can be used to perform data
cleaning before data visualization. Machine learning is part of the data
visualization process.

Conclusion

Data visualization is important in every aspect of data science. We


should clean our data before making any visualization. We should
choose the right tool or software that addresses our needs, such as
affordability, ease of use, etc. The main challenge in data visualization
is choosing the right plot type. It depends on many factors. Finally, you
need excellent public speaking and presentation skills to present your
findings.

Today, we discussed data visualization applications and methods in


detail with examples. Learning data visualization is not straightforward.
You should master many skills for that.

Frequently Asked Questions (FAQs)


1. What are the three main goals of data visualization?
• Communicating your results or findings with your audience
• Exploring (knowing) your data
• Identify trends, patterns and correlations between variables
2. How is data visualization used in data science?

Data visualization is used in every aspect of data science:

• Tuning hyperparameters
• Monitoring the model’s performance
• Cleaning data
• Validating the model’s assumptions
3. What are the major challenges of data visualization

• Choosing the right plot type


• Identifying the needs of your audience
• Developing the research question convert it to a data science
question
• Collecting data
4. What are the benefits of data visualization?

Commons use cases of data visualization include:

• Communicate your results or findings with your audience


• Tune hyperparameters
• Identify trends, patterns and correlations between variables
• Monitor the model’s performance
• Clean data
• Validate the model’s assumptions

You might also like