0% found this document useful (0 votes)
43 views22 pages

Eds Unit 3

Uploaded by

Adhiban R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views22 pages

Eds Unit 3

Uploaded by

Adhiban R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

COS 2402-ESSENTIALS OF

DATA SCIENCE
Unit III

0
Unit III-Data Modeling and Exploration

Data Modeling

1
2
3
4
5
6
Data Exploration: Introduction

7
8
9
10
11
Data Visualization Process/Workflow

The data visualization process or workflow includes the following key


steps.

1. Develop your research question

This may be a business problem or any other related problem that


could be solved with a data-driven approach. You should note all the
objectives and outcomes plus required resources such as datasets,
open-source software libraries, etc.

2. Get or create your data

The next step is collecting data. You can use existing datasets if
they’re relevant to your research question. Alternatively, you can
download open-source datasets from the internet or do web scraping to
collect data.

3. Clean your data

Real-world data are messy. So, you need to clean them before using
them for visualization. You can identify missing values and outliers
and treat them accordingly. You can perform feature selection and
remove unnecessary features from the data. You can create a new
set of features based on the original features.

12
4. Choose a chart type

The chart type depends on many factors. For example, it depends on


the feature type (numerical or categorical). It also depends on the
type of visualization you need. Let’s say you have two numerical
features. If you want to find their distributions, you can create two
histograms for each feature. If you want to plot their variations, you
can create box and whisker plots for each feature. You can create a
scatterplot if you want to find a relationship (linear or non-linear,
positive or negative) between the two features.

5. Choose your tool

You can use open-source data visualization tools such as matplotlib,


seaborn, plotty and ggplot. You can also use API-based software
such as Matlab, Minitab, SPSS, etc.

6. Prepare data

You can extract relevant features. You can do feature standardization


if the values of the features are not on the same scale. You can apply
data preprocessing steps such as PCA to reduce the dimensionality
of the data. That will allow you to visualize high-dimensional data in
2D and 3D plots!

7. Create a chart

This is the final step. Here. You define the title and names for the
axes. You should also choose a proper chart background to ensure
the content is easily readable.

13
Data Visualization Techniques in Data Science

Some of the main data visualization techniques in data science are


univariate analysis, bivariate analysis and multivariate analysis.

1. Univariate Analysis

In univariate analysis, as the name suggest, we analyze only one


variable at a time. In other words, we analyze each variable
separately. Bar charts, pie charts, box plots and histograms are
common examples of univariate data visualization. Bar charts and pie
charts are created for categorical variables, while box plots and
histograms are created for numerical variables.

2. Bivariate Analysis

In bivariate analysis, we analyze two variables at a time. Often, we


see whether there is a relationship between the two variables. The
scatter plot is a classic example of bivariate data visualization.

3. Multivariate Analysis

In multivariate analysis, we analyze more than two variables


simultaneously. The heatmap is a classic example of multivariate
data visualization. Other examples are cluster analysis and principal
component analysis (PCA).

Advantages and Disadvantages of Data Visualization

Advantages

There are many advantages of data visualization. Data visualization


is used to:

14
• Communicate your results or findings with your audience
• Tune hyperparameters
• Identify trends, patterns and correlations between variables
• Monitor the model’s performance
• Clean data
• Validate the model’s assumptions
Disadvantages

There are also some disadvantages of data visualization.

• We need to download, install and configure software and open-


source libraries. The process will be difficult and time-
consuming for beginners.
• Some data visualization tools are not available for free. We
need to pay for those.
• When we summarize the data, we’ll lose the exact information.

15
16
17
18
19
Importance of data visualization

Data visualization provides a quick and effective way to communicate information in a


universal manner using visual information. The practice can also help businesses identify
which factors affect customer behavior; pinpoint areas that need to be improved or need more
attention; make data more memorable for stakeholders; understand when and where to place
specific products; and predict sales volumes.

Other benefits of data visualization include the following:

• the ability to absorb information quickly, improve insights and make faster decisions;

• an increased understanding of the next steps that must be taken to improve the organization;

• an improved ability to maintain the audience's interest with information they can
understand;

20
• an easy distribution of information that increases the opportunity to share insights with
everyone involved;

• eliminate the need for data scientists since data is more accessible and understandable; and

• an increased ability to act on findings quickly and, therefore, achieve success with greater
speed and less mistakes.

Data visualization and big data

The increased popularity of big data and data analysis projects have made visualization more
important than ever. Companies are increasingly using machine learning to gather massive
amounts of data that can be difficult and slow to sort through, comprehend and explain.
Visualization offers a means to speed this up and present information to business owners and
stakeholders in ways they can understand.

Big data visualization often goes beyond the typical techniques used in normal visualization,
such as pie charts, histograms and corporate graphs. It instead uses more complex
representations, such as heat maps and fever charts. Big data visualization requires powerful
computer systems to collect raw data, process it and turn it into graphical representations that
humans can use to quickly draw insights.

While big data visualization can be beneficial, it can pose several disadvantages to
organizations. They are as follows:

• To get the most out of big data visualization tools, a visualization specialist must be hired.
This specialist must be able to identify the best data sets and visualization styles to
guarantee organizations are optimizing the use of their data.

• Big data visualization projects often require involvement from IT, as well as management,
since the visualization of big data requires powerful computer hardware, efficient storage
systems and even a move to the cloud.

• The insights provided by big data visualization will only be as accurate as the information
being visualized. Therefore, it is essential to have people and processes in place to govern
and control the quality of corporate data, metadata and data sources.

21

You might also like