0% found this document useful (0 votes)
4 views40 pages

Module 1: Data Visualization and Data Exploration: Prepared by Dr. Ganesha Prasad, Dept. of AI & ML

Module 1 covers Data Visualization and Exploration, emphasizing the importance of visual representation of data through various tools and libraries. It discusses statistical concepts, operations using Numpy and Pandas, and the advantages and disadvantages of data visualization. Additionally, it highlights applications in business intelligence, financial analysis, and healthcare, along with data wrangling techniques to prepare data for visualization.

Uploaded by

Ganesh Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views40 pages

Module 1: Data Visualization and Data Exploration: Prepared by Dr. Ganesha Prasad, Dept. of AI & ML

Module 1 covers Data Visualization and Exploration, emphasizing the importance of visual representation of data through various tools and libraries. It discusses statistical concepts, operations using Numpy and Pandas, and the advantages and disadvantages of data visualization. Additionally, it highlights applications in business intelligence, financial analysis, and healthcare, along with data wrangling techniques to prepare data for visualization.

Uploaded by

Ganesh Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Module 1: Data

Visualization and Data


Exploration

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Module 1: Data Visualization and
Data Exploration
• Introduction: Data Visualization, Importance of Data
Visualization, Data Wrangling, Tools and Libraries for
Visualization
• Overview of Statistics: Measures of Central Tendency,
Measures of Dispersion, Correlation, Types od Data, Summary
Statistics
• Numpy: Numpy Operations - Indexing, Slicing, Splitting,
Iterating, Filtering, Sorting, Combining, and Reshaping
• Pandas: Advantages of pandas over numpy, Disadvantages
of pandas, Pandas operation - Indexing, Slicing, Iterating,
Filtering, Sorting and Reshaping using Pandas
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Data Visualization
• Data visualization is the graphical representation of
information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and
patterns in data.
• Data visualization translates complex data sets into visual
formats that are easier for the human brain to comprehend.
This can include a variety of visual tools such as:
• Charts: Bar charts, line charts, pie charts, etc.
• Graphs: Scatter plots, histograms, etc.
• Maps: Geographic maps, heat maps, etc.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Tools for Visualization of Data
• The following are the 10 best Data Visualization Tools
1.Tableau
2.Looker
3.Zoho Analytics
4.Sisense
5.IBM Cognos Analytics
6.Qlik Sense
7.Domo
8.Microsoft Power BI
9.Klipfolio
10.SAP Analytics Prepared
Cloud by Dr. Ganesha Prasad, Dept. of AI & ML
Advantages of Data
Visualization:
• Enhanced Comparison: Visualizing performances of two elements or scenarios
streamlines analysis, saving time compared to traditional data examination.
• Improved Methodology: Representing data graphically offers a superior
understanding of situations, exemplified by tools like Google Trends illustrating
industry trends in graphical forms.
• Efficient Data Sharing: Visual data presentation facilitates effective
communication, making information more digestible and engaging compared to
sharing raw data.
• Sales Analysis: Data visualization aids sales professionals in comprehending
product sales trends, identifying influencing factors through tools like heat maps, and
understanding customer types, geography impacts, and repeat customer behaviors.
• Identifying Event Relations: Discovering correlations between events helps
businesses understand external factors affecting their performance, such as online
sales surges during festive seasons.
• Exploring Opportunities and Trends: Data visualization empowers business
leaders to uncover patterns and opportunities within vast datasets, enabling a deeper
understanding of customer behaviors and insights into emerging business trends.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Disadvantages of Data Visualization:
• Can be time-consuming: Creating visualizations can be a time-consuming
process, especially when dealing with large and complex datasets.
• Can be misleading: While data visualization can help identify patterns
and relationships in data, it can also be misleading if not done correctly.
Visualizations can create the impression of patterns or trends that may
not exist, leading to incorrect conclusions and poor decision-making.
• Can be difficult to interpret: Some types of visualizations, such as those
that involve 3D or interactive elements, can be difficult to interpret and
understand.
• May not be suitable for all types of data: Certain types of data, such as
text or audio data, may not lend themselves well to visualization. In
these cases, alternative methods of analysis may be more appropriate.
• May not be accessible to all users: Some users may have visual
impairments or other disabilities that make it difficult or impossible for
them to interpret visualizations. In these cases, alternative methods of
presenting data may be necessary to ensure accessibility.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Applications of Data Visualization

• Business Intelligence and Reporting


• Financial Analysis
• Healthcare
• Marketing and Sales
• Human Resources

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Data Wrangling
• Data wrangling is the process of transforming raw data into a suitable
representation for various tasks. It is the discipline of augmenting,
cleaning, filtering, standardizing, and enriching data in a way that allows
it to be used in a downstream task, which in our case is data
visualization.
• Data Wrangling is also known as Data Munging.

• Example:Books selling Website want to show top-selling books of


different domains, according to user preference. For example, if a new
user searches for motivational books, then they want to show those
motivational books which sell the most or have a high rating, etc.
• But on their website, there are plenty of raw data from different users.
Here the concept of Data Munging or Data Wrangling is used. As we
know Data wrangling is not by the System itself. This process is done by
Data Scientists. So, the data Scientist will wrangle data in such a way
that they will sort the motivational books that are sold more or have
high ratings or user buy this book with these package of Books, etc. On
the basis of that, the new user will make a choice. This will explain the
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
importance of Data wrangling.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Tools and Libraries for Visualization
Commonly used tools are:
• Non-coding tool – Tableau, power BI
• Coding tool – Python, MATLAB and R

Note: libraries we will going to discuss in detail in the coming


chapters.
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Overview of Statistics
• Statistics is a combination of the analysis, collection,
interpretation, and representation of numerical data.
• Probability is a measure of the likelihood that an event will
occur and is quantified as a number between 0 and 1.
• A probability distribution is a function that provides the
probability for every possible event. A probability distribution is
frequently used for statistical analysis. The higher the
probability, the more likely the event. There are two types of
probability distributions, namely:
Discrete
Continuous.

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Discrete probability distribution
A discrete probability distribution
shows all the values that a random
variable can take, together with their
probability. The following diagram
illustrates an example of a discrete
probability distribution. If we have a six-
sided die, we can roll each number
between 1 and 6. We have six events
that can occur based on the number
that's rolled. There is an equal
probability of rolling any of the
numbers, and the individual probability
of any of the six events occurring is 1/6:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Continuous probability
distribution
• A continuous probability
distribution defines the probabilities
of each possible value of a continuous
random variable. The following
diagram provides an example of a
continuous probability distribution.
This example illustrates the
distribution of the time needed to
drive home. In most cases, around 60
minutes is needed, but sometimes,
less time is needed because there is
no traffic, and sometimes, much more
time is needed if there are traffic jams:
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Measures of Central Tendency
• Mean: The arithmetic average is computed by summing up all
measurements and dividing the sum by the number of observations.
The mean is calculated as follows:

• Median: This is the middle value of the ordered dataset. If there is


an even number of observations, the median will be the average of
the two middle values. The median is less prone to outliers compared
to the mean, where outliers are distinct values in data.

• Mode: Our last measure of central tendency, the mode is defined as


the most frequent value. There may be more than one mode in cases
where multiple values Prepared
are equally frequent.
by Dr. Ganesha Prasad, Dept. of AI & ML
Example
• For example, a die was rolled 10 times, and we got the following
numbers: 4, 5, 4, 3, 4, 2, 1, 1, 2, and 1.

The mean is calculated by summing all the events and dividing


them by the number of observations:
(4+5+4+3+4+2+1+1+2+1)/10=2.7.

To calculate the median, the die rolls have to be ordered according


to their values. The ordered values are as follows: 1, 1, 1, 2, 2, 3, 4,
4, 4, 5. Since we have an even number of die rolls, we need to take
the average of the two middle values. The average of the two middle
values is (2+3)/2=2.5.

The modes are 1 and 4Prepared


since by Dr.they areDept.
Ganesha Prasad, the
of AItwo
& ML most frequent events.
Measures of Dispersion
• Dispersion, also called variability, is the extent to which a
probability distribution is stretched or squeezed. The different
measures of dispersion are as follows:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Correlation
Correlation describes the statistical relationship
between two variables:
• In a positive correlation, both variables move in the
same direction.
• In a negative correlation, the variables move in
opposite directions.
• In zero correlation, the variables are not related.

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Types of Data

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


NumPy
• It provides support for large n-dimensional arrays and has built-
in support for many high-level mathematical and statistical
operations.
• Mean

• Median

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


• Var, std

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Basic NumPy Operations
Indexing
• Indexing elements in a NumPy array, at a high level, works the
same as with built-in Python lists. Therefore, we can index
elements in multi-dimensional matrices:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


• Slicing: Being able to easily slice parts of lists into new
ndarrays is very helpful when handling large amounts of
data

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Splitting
• Splitting data can be helpful in many situations, from plotting
only half of your timeseries data to separating test and training
data for machine learning algorithms. There are two ways of
splitting your data, horizontally and vertically. Horizontal
splitting can be done with the hsplit method. Vertical splitting
can be done with the vsplit method:

Horizontal Split : Splits along


columns (axis 1).
Vertical Split :
Splits along rows (axis 0).

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Iterating
• Iterating the NumPy data structures, ndarrays, is also possible.
It steps over the whole list of data one after another, visiting
every single element in the ndarray once. Considering that they
can have several dimensions, indexing gets very complex. The
nditer is a multi-dimensional iterator object that iterates over a
given number of arrays:

The ndenumerate will give us exactly


this index

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Filtering

• Filtering is a very powerful tool that can be used to


clean up your data if you want to avoid outlier values.

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Sorting
• Sorting each row of a dataset can be really useful. Using
NumPy, we are also able to sort on other dimensions, such as
columns.
• In addition, argsort gives us the possibility to get a list of
indices, which would result in a sorted list:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Combining
• Stacking rows and columns onto an existing dataset can be
helpful when you have two datasets of the same dimension
saved to different files.
• Given two datasets, we use vstack to stack dataset_1 on top
of dataset_2, which will give us a combined dataset with all
the rows from dataset_1, followed by all the rows from
dataset_2.
• If we use hstack, we stack our datasets "next to each other,"
meaning that the elements from the first row of dataset_1
will be followed by the elements of the first row of dataset_2.
This will be applied to each row:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Reshaping
• Reshaping can be crucial for some algorithms. Depending
on the nature of your data, it might help you to reduce
dimensionality to make visualization easier:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Pandas
• The pandas Python library provides data structures and methods
for manipulating different types of data, such as numerical and
temporal data. These operations are easy to use and highly
optimized for performance. Data formats, such as CSV and
JSON, and databases can be used to create DataFrames.
• DataFrames are the internal representations of data and are
very similar to tables but are more powerful since they allow you
to efficiently apply operations such as multiplications,
aggregations, and even joins. Importing and reading both
files and in-memory data is abstracted into a user-friendly
interface. When it comes to handling missing data, pandas
provide built-in solutions to clean up and augment your data,
meaning it fills in missing values with reasonable values.

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


• Integrated indexing and label-based slicing in combination
with fancy indexing (what we already saw with NumPy) make
handling data simple. More complex techniques, such as
reshaping, pivoting, and melting data, together with the
possibility of easily joining and merging data, provide
powerful tooling so that you can handle your data correctly.

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Basic Operations of pandas
Indexing:
• Indexing with pandas is a bit more complex than with
NumPy. We can only access columns with a single bracket. To
use the indices of the rows to access them, we need the iloc
method. If we want to access them with index_col (which
was set in the read_csv call), we need to use the loc method:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Prepared by Dr. Ganesha Prasad, Dept. of AI & ML
Series

• A pandas Series is a one-dimensional labeled array that is


capable of holding any type of data. We can create a Series by
loading datasets from a .csv file, Excel spreadsheet, or SQL
database.

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Advanced pandas Operations
• Filtering
Filtering in pandas has a higher-level interface than NumPy. You
can still use the simple brackets-based conditional filtering.
However, you're also able to use more complex queries, for
example, filter rows based on labels using likeness, which allows
us to search for a substring using the like argument and even
full regular expressions using regex:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Sorting
• Sorting each row or column based on a given row or
column will help you analyze your data better and find
the ranking of a given dataset. With pandas, we are
able to do this pretty easily. Sorting in ascending and
descending order can be done using the parameter
known as ascending. The default sorting order is
ascending. Of course, you can do more complex sorting
by providing more than one value in the by = [ ] list.
Those will then be used to sort values for which the first
value is the same:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML


Reshaping
• Reshaping can be crucial for easier visualization and
algorithms. However, depending on your data, this can get
really complex:

Prepared by Dr. Ganesha Prasad, Dept. of AI & ML

You might also like