0% found this document useful (0 votes)
3 views

Lesson 1 - Data Visualisation

Uploaded by

star
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lesson 1 - Data Visualisation

Uploaded by

star
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Intro to

Data
Visualizatio
n
Python Version
What is Data Visualization For?

Exploratory Data
Analysis Benchmarking
Find out important Generating reports to
features and identify any present to
anomalies stakeholders/team

Use Cases

Evaluating Business
Problems Dashboarding
Identify any areas of Keeps Track of KPIs in
improvement by looking multiple important
at data business functions
Machine Learning Pipeline

Data Preparation & Exploratory Data Analysis


01

Feature Engineering & Feature Selection


02

Model Selection & Testing


03

Hyper-parameter Tuning & Overall Evaluation


04

Deployment
05
Basic Packages for Data Visualization

Data Manipulation:
pandas

Cool Charts:
Matplotlib
Seaborn (nicer graphs)

Interactive Charts:
Plotly
Why do EDA?

● Sanity Checks to look for missing data, outliers, redundant


information, etc
● Understand the distribution of the data and their relationships
● Look out for data leakage!
● Start Feature Selection
Installing/
Importing
Packages
Installing Packages

!pip install numpy==1.18.5


● Run this line of code to install numpy
● Packages must be installed before you can import them
● You can specify the version you want to install, in this case,
v1.18.5
Importing

import XX as YY
● XX refers to the package you want to import
● YY refers to how you will refer to the package as
Pandas
Functions
pd.read_csv()

Reading CSV file


● Data is often stored in the form of CSV
● Allows us to read the CSV, and put it into a tabular format, stored as a dataframe
● Link to CSV here
df.head() & df.tail()

Reading top & bottom rows


● Allows you to do a sanity check for the first few or last few rows of your dataframe
● Can specify the number of rows you want, default set to 5 rows
● e.g. df.head(10) for first 10 rows
● e.g. df.tail(10) for last 10 rows
df.info()

df.info()
● Gives us a quick summary of all the columns
● How do we deal with Null values if any?
df.describe()

df.describe()
● Gives us a quick summary of all the numerical
columns
Value Counts

df[‘Gender’].describe()
● Gives us a quick summary of the categorical variable
Unique Values

df['Season'].unique()
● Obtain all the unique values in a column
Dropping Columns

df.drop(['Gender'], axis = 1, inplace = True)


● Axis = 0 represents Row, Axis = 1 represents Column
● inplace = True to mutate (and save) the data
Dropping NA Values

df.dropna()
● Remove all rows with null or NaN (Not a Number) values
● Any row with a null/NaN value in any column will be dropped

Customer ID Age Payment Method

1 55 Venmo Customer ID Age Payment Method

2 NaN Cash 1 55 Venmo

3 50 NaN
4 21 PayPal

4 21 PayPal
Filling NA Values

df.fillna()
● Fill all rows with null values with a specified values
● e.g., df_filled = df_simple.fillna({'Age': 0, 'Payment Method': 'Unknown'})

Customer ID Age Payment Method Customer ID Age Payment Method

1 55 Venmo 1 55 Venmo

2 NaN Cash 2 0 Cash

3 50 NaN 3 50 Unknown

4 21 PayPal 4 21 PayPal
Sorting Values

df['Season'].value_counts().sort_values()
● Sorts the values of a column
● Smallest to Largest by default
● Use ascending = False sort by largest first
Datetime Manipulation

df[‘year_month’] = pd.to_datetime(df[‘order_date’])
● Converts a column (order_date) to datetime format
● Required for time-series analysis
Charts &
Plots
Types of Charts

Categorical Variables Numerical Data Time-Series Data


Visualization
- Bar charts - Time-series line charts
- Pie charts - Scatter plots - Area charts
- Stacked bar charts - Line charts - Gantt charts
- Radar charts - Histograms
- Box plots

Geospatial Data Others


-Choropleth map -Correlation Matrix
-Heatmap -Waterfall chart
-Many more!
Pie Charts

plt.pie(gender_counts, labels=gender_counts.index,
colors=colors, autopct='%1.1f%%')
● Useful for very little categories
● Very intuitive
● Used to quickly check for class imbalances /
category imbalances
Box Plots
sns.boxplot(x=col, y=counts)
● Quick overview of data distribution of all numerical features
● Can easily identify outliers
● Question: What do we do if we see any extreme outliers here?
Range

Q3 Quartile

outliers Median

Q1 Quartile
Bar Plots
sns.barplot(data=df, x='Category', y='Purchase Amount (USD)', errorbar=None, palette="pastel")
● Can accommodate for more categories
● Very intuitive, can add in a lot of other details i.e., counts
● Used to check for distribution of categories
Correlation Matrix Heatmap
sns.heatmap(correlation_matrix, annot=True, cmap=cmap, linewidths=0.5, square = True)
● Looks at correlation between all numerical values
● Extremely important to understand collinearity
● Useful to determine whether we need to remove highly correlated variables
Density Plot
sns.kdeplot(age_data, fill=True, legend=False)
● Look at the density distribution of continuous variables
● Useful for checking if the distribution is normal
● Easy to look for extreme outliers (e.g., high density of age > 80)
Choropleth Map
merged_data.plot(column='count', cmap='coolwarm', linewidth=0.8, ax=ax, edgecolor='0.8')
● Easily the coolest map (+10 style points)
● I added this map for fun
● Very useful in showing data distribution for geographical data
Data
Cleaning
Dealing with NA Values

df.dropna()

● Easiest method, just remove unclean data!


● What if NA values make up >20% of the data?

df[column].fillna()
● Fill NA values with something else
● Mean? Median?
● What about for categorical values?

What if data is not missing at random?


Dealing with NA Values

● In this example, all females’ Group is unknown


○ Data is not missing at random
○ Fill with ‘Unknown’

● What about Salary?


○ Impute with mean/median
○ Fill with 0

Gender Group Salary

Male A 50

Female ? 100

Male A ?

Female ? 50
Dealing with Extreme Values

Extreme outliers can skew models!


Some outliers may even be due to human error
Some data inherently will be extremely skewed: Salary, Age
df.drop()

● Easiest method also, just remove extreme outliers!


● What if it is not a normal distribution?

What if extreme data points tell us something important?


E.g., using salary to predict spending
Binning Numerical Data

Changing numerical features to categorical features


● Create groups for extreme values
○ <20: Age Group 0-20
○ >60: Age Group 61+
Learning from Documentation

SEABORN API : https://seaborn.pydata.org/api.html

● In Built examples of how you can use their plot methods


● Detailed explanation of customisations/parameters
Exercise

You might also like