Lesson 1 - Data Visualisation
Lesson 1 - Data Visualisation
Data
Visualizatio
n
Python Version
What is Data Visualization For?
Exploratory Data
Analysis Benchmarking
Find out important Generating reports to
features and identify any present to
anomalies stakeholders/team
Use Cases
Evaluating Business
Problems Dashboarding
Identify any areas of Keeps Track of KPIs in
improvement by looking multiple important
at data business functions
Machine Learning Pipeline
Deployment
05
Basic Packages for Data Visualization
Data Manipulation:
pandas
Cool Charts:
Matplotlib
Seaborn (nicer graphs)
Interactive Charts:
Plotly
Why do EDA?
import XX as YY
● XX refers to the package you want to import
● YY refers to how you will refer to the package as
Pandas
Functions
pd.read_csv()
df.info()
● Gives us a quick summary of all the columns
● How do we deal with Null values if any?
df.describe()
df.describe()
● Gives us a quick summary of all the numerical
columns
Value Counts
df[‘Gender’].describe()
● Gives us a quick summary of the categorical variable
Unique Values
df['Season'].unique()
● Obtain all the unique values in a column
Dropping Columns
df.dropna()
● Remove all rows with null or NaN (Not a Number) values
● Any row with a null/NaN value in any column will be dropped
3 50 NaN
4 21 PayPal
4 21 PayPal
Filling NA Values
df.fillna()
● Fill all rows with null values with a specified values
● e.g., df_filled = df_simple.fillna({'Age': 0, 'Payment Method': 'Unknown'})
1 55 Venmo 1 55 Venmo
3 50 NaN 3 50 Unknown
4 21 PayPal 4 21 PayPal
Sorting Values
df['Season'].value_counts().sort_values()
● Sorts the values of a column
● Smallest to Largest by default
● Use ascending = False sort by largest first
Datetime Manipulation
df[‘year_month’] = pd.to_datetime(df[‘order_date’])
● Converts a column (order_date) to datetime format
● Required for time-series analysis
Charts &
Plots
Types of Charts
plt.pie(gender_counts, labels=gender_counts.index,
colors=colors, autopct='%1.1f%%')
● Useful for very little categories
● Very intuitive
● Used to quickly check for class imbalances /
category imbalances
Box Plots
sns.boxplot(x=col, y=counts)
● Quick overview of data distribution of all numerical features
● Can easily identify outliers
● Question: What do we do if we see any extreme outliers here?
Range
Q3 Quartile
outliers Median
Q1 Quartile
Bar Plots
sns.barplot(data=df, x='Category', y='Purchase Amount (USD)', errorbar=None, palette="pastel")
● Can accommodate for more categories
● Very intuitive, can add in a lot of other details i.e., counts
● Used to check for distribution of categories
Correlation Matrix Heatmap
sns.heatmap(correlation_matrix, annot=True, cmap=cmap, linewidths=0.5, square = True)
● Looks at correlation between all numerical values
● Extremely important to understand collinearity
● Useful to determine whether we need to remove highly correlated variables
Density Plot
sns.kdeplot(age_data, fill=True, legend=False)
● Look at the density distribution of continuous variables
● Useful for checking if the distribution is normal
● Easy to look for extreme outliers (e.g., high density of age > 80)
Choropleth Map
merged_data.plot(column='count', cmap='coolwarm', linewidth=0.8, ax=ax, edgecolor='0.8')
● Easily the coolest map (+10 style points)
● I added this map for fun
● Very useful in showing data distribution for geographical data
Data
Cleaning
Dealing with NA Values
df.dropna()
df[column].fillna()
● Fill NA values with something else
● Mean? Median?
● What about for categorical values?
Male A 50
Female ? 100
Male A ?
Female ? 50
Dealing with Extreme Values