Intro to
Data
Visualizatio
n
Python Version
What is Data Visualization For?
Exploratory Data
Analysis Benchmarking
Find out important Generating reports to
features and identify any present to
anomalies stakeholders/team
Use Cases
Evaluating Business
Problems Dashboarding
Identify any areas of Keeps Track of KPIs in
improvement by looking multiple important
at data business functions
Machine Learning Pipeline
Data Preparation & Exploratory Data Analysis
01
Feature Engineering & Feature Selection
02
Model Selection & Testing
03
Hyper-parameter Tuning & Overall Evaluation
04
Deployment
05
Basic Packages for Data Visualization
Data Manipulation:
pandas
Cool Charts:
Matplotlib
Seaborn (nicer graphs)
Interactive Charts:
Plotly
Why do EDA?
● Sanity Checks to look for missing data, outliers, redundant
information, etc
● Understand the distribution of the data and their relationships
● Look out for data leakage!
● Start Feature Selection
Installing/
Importing
Packages
Installing Packages
!pip install numpy==1.18.5
● Run this line of code to install numpy
● Packages must be installed before you can import them
● You can specify the version you want to install, in this case,
v1.18.5
Importing
import XX as YY
● XX refers to the package you want to import
● YY refers to how you will refer to the package as
Pandas
Functions
pd.read_csv()
Reading CSV file
● Data is often stored in the form of CSV
● Allows us to read the CSV, and put it into a tabular format, stored as a dataframe
● Link to CSV here
df.head() & df.tail()
Reading top & bottom rows
● Allows you to do a sanity check for the first few or last few rows of your dataframe
● Can specify the number of rows you want, default set to 5 rows
● e.g. df.head(10) for first 10 rows
● e.g. df.tail(10) for last 10 rows
df.info()
df.info()
● Gives us a quick summary of all the columns
● How do we deal with Null values if any?
df.describe()
df.describe()
● Gives us a quick summary of all the numerical
columns
Value Counts
df[‘Gender’].describe()
● Gives us a quick summary of the categorical variable
Unique Values
df['Season'].unique()
● Obtain all the unique values in a column
Dropping Columns
df.drop(['Gender'], axis = 1, inplace = True)
● Axis = 0 represents Row, Axis = 1 represents Column
● inplace = True to mutate (and save) the data
Dropping NA Values
df.dropna()
● Remove all rows with null or NaN (Not a Number) values
● Any row with a null/NaN value in any column will be dropped
Customer ID Age Payment Method
1 55 Venmo Customer ID Age Payment Method
2 NaN Cash 1 55 Venmo
3 50 NaN
4 21 PayPal
4 21 PayPal
Filling NA Values
df.fillna()
● Fill all rows with null values with a specified values
● e.g., df_filled = df_simple.fillna({'Age': 0, 'Payment Method': 'Unknown'})
Customer ID Age Payment Method Customer ID Age Payment Method
1 55 Venmo 1 55 Venmo
2 NaN Cash 2 0 Cash
3 50 NaN 3 50 Unknown
4 21 PayPal 4 21 PayPal
Sorting Values
df['Season'].value_counts().sort_values()
● Sorts the values of a column
● Smallest to Largest by default
● Use ascending = False sort by largest first
Datetime Manipulation
df[‘year_month’] = pd.to_datetime(df[‘order_date’])
● Converts a column (order_date) to datetime format
● Required for time-series analysis
Charts &
Plots
Types of Charts
Categorical Variables Numerical Data Time-Series Data
Visualization
- Bar charts - Time-series line charts
- Pie charts - Scatter plots - Area charts
- Stacked bar charts - Line charts - Gantt charts
- Radar charts - Histograms
- Box plots
Geospatial Data Others
-Choropleth map -Correlation Matrix
-Heatmap -Waterfall chart
-Many more!
Pie Charts
plt.pie(gender_counts, labels=gender_counts.index,
colors=colors, autopct='%1.1f%%')
● Useful for very little categories
● Very intuitive
● Used to quickly check for class imbalances /
category imbalances
Box Plots
sns.boxplot(x=col, y=counts)
● Quick overview of data distribution of all numerical features
● Can easily identify outliers
● Question: What do we do if we see any extreme outliers here?
Range
Q3 Quartile
outliers Median
Q1 Quartile
Bar Plots
sns.barplot(data=df, x='Category', y='Purchase Amount (USD)', errorbar=None, palette="pastel")
● Can accommodate for more categories
● Very intuitive, can add in a lot of other details i.e., counts
● Used to check for distribution of categories
Correlation Matrix Heatmap
sns.heatmap(correlation_matrix, annot=True, cmap=cmap, linewidths=0.5, square = True)
● Looks at correlation between all numerical values
● Extremely important to understand collinearity
● Useful to determine whether we need to remove highly correlated variables
Density Plot
sns.kdeplot(age_data, fill=True, legend=False)
● Look at the density distribution of continuous variables
● Useful for checking if the distribution is normal
● Easy to look for extreme outliers (e.g., high density of age > 80)
Choropleth Map
merged_data.plot(column='count', cmap='coolwarm', linewidth=0.8, ax=ax, edgecolor='0.8')
● Easily the coolest map (+10 style points)
● I added this map for fun
● Very useful in showing data distribution for geographical data
Data
Cleaning
Dealing with NA Values
df.dropna()
● Easiest method, just remove unclean data!
● What if NA values make up >20% of the data?
df[column].fillna()
● Fill NA values with something else
● Mean? Median?
● What about for categorical values?
What if data is not missing at random?
Dealing with NA Values
● In this example, all females’ Group is unknown
○ Data is not missing at random
○ Fill with ‘Unknown’
● What about Salary?
○ Impute with mean/median
○ Fill with 0
Gender Group Salary
Male A 50
Female ? 100
Male A ?
Female ? 50
Dealing with Extreme Values
Extreme outliers can skew models!
Some outliers may even be due to human error
Some data inherently will be extremely skewed: Salary, Age
df.drop()
● Easiest method also, just remove extreme outliers!
● What if it is not a normal distribution?
What if extreme data points tell us something important?
E.g., using salary to predict spending
Binning Numerical Data
Changing numerical features to categorical features
● Create groups for extreme values
○ <20: Age Group 0-20
○ >60: Age Group 61+
Learning from Documentation
SEABORN API : https://seaborn.pydata.org/api.html
● In Built examples of how you can use their plot methods
● Detailed explanation of customisations/parameters
Exercise