0% found this document useful (0 votes)

6 views35 pages

Lesson 1 - Data Visualisation

Uploaded by

star

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views35 pages

Lesson 1 - Data Visualisation

Uploaded by

star

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Intro to

Data
Visualizatio
n
Python Version
What is Data Visualization For?

Exploratory Data
Analysis Benchmarking
Find out important Generating reports to
features and identify any present to
anomalies stakeholders/team

Use Cases

Evaluating Business
Problems Dashboarding
Identify any areas of Keeps Track of KPIs in
improvement by looking multiple important
at data business functions
Machine Learning Pipeline

Data Preparation & Exploratory Data Analysis

Feature Engineering & Feature Selection

Model Selection & Testing

Hyper-parameter Tuning & Overall Evaluation

Deployment
05
Basic Packages for Data Visualization

Data Manipulation:
pandas

Cool Charts:
Matplotlib
Seaborn (nicer graphs)

Interactive Charts:
Plotly
Why do EDA?

● Sanity Checks to look for missing data, outliers, redundant

information, etc
● Understand the distribution of the data and their relationships
● Look out for data leakage!
● Start Feature Selection
Installing/
Importing
Packages
Installing Packages

!pip install numpy==1.18.5

● Run this line of code to install numpy
● Packages must be installed before you can import them
● You can specify the version you want to install, in this case,
v1.18.5
Importing

import XX as YY
● XX refers to the package you want to import
● YY refers to how you will refer to the package as
Pandas
Functions
pd.read_csv()

Reading CSV file

● Data is often stored in the form of CSV
● Allows us to read the CSV, and put it into a tabular format, stored as a dataframe
● Link to CSV here
df.head() & df.tail()

Reading top & bottom rows

● Allows you to do a sanity check for the first few or last few rows of your dataframe
● Can specify the number of rows you want, default set to 5 rows
● e.g. df.head(10) for first 10 rows
● e.g. df.tail(10) for last 10 rows
df.info()

df.info()
● Gives us a quick summary of all the columns
● How do we deal with Null values if any?
df.describe()

df.describe()
● Gives us a quick summary of all the numerical
columns
Value Counts

df[‘Gender’].describe()
● Gives us a quick summary of the categorical variable
Unique Values

df['Season'].unique()
● Obtain all the unique values in a column
Dropping Columns

df.drop(['Gender'], axis = 1, inplace = True)

● Axis = 0 represents Row, Axis = 1 represents Column
● inplace = True to mutate (and save) the data
Dropping NA Values

df.dropna()
● Remove all rows with null or NaN (Not a Number) values
● Any row with a null/NaN value in any column will be dropped

Customer ID Age Payment Method

1 55 Venmo Customer ID Age Payment Method

2 NaN Cash 1 55 Venmo

3 50 NaN
4 21 PayPal

4 21 PayPal
Filling NA Values

df.fillna()
● Fill all rows with null values with a specified values
● e.g., df_filled = df_simple.fillna({'Age': 0, 'Payment Method': 'Unknown'})

Customer ID Age Payment Method Customer ID Age Payment Method

1 55 Venmo 1 55 Venmo

2 NaN Cash 2 0 Cash

3 50 NaN 3 50 Unknown

4 21 PayPal 4 21 PayPal
Sorting Values

df['Season'].value_counts().sort_values()
● Sorts the values of a column
● Smallest to Largest by default
● Use ascending = False sort by largest first
Datetime Manipulation

df[‘year_month’] = pd.to_datetime(df[‘order_date’])
● Converts a column (order_date) to datetime format
● Required for time-series analysis
Charts &
Plots
Types of Charts

Categorical Variables Numerical Data Time-Series Data

Visualization
- Bar charts - Time-series line charts
- Pie charts - Scatter plots - Area charts
- Stacked bar charts - Line charts - Gantt charts
- Radar charts - Histograms
- Box plots

Geospatial Data Others

-Choropleth map -Correlation Matrix
-Heatmap -Waterfall chart
-Many more!
Pie Charts

plt.pie(gender_counts, labels=gender_counts.index,
colors=colors, autopct='%1.1f%%')
● Useful for very little categories
● Very intuitive
● Used to quickly check for class imbalances /
category imbalances
Box Plots
sns.boxplot(x=col, y=counts)
● Quick overview of data distribution of all numerical features
● Can easily identify outliers
● Question: What do we do if we see any extreme outliers here?
Range

Q3 Quartile

outliers Median

Q1 Quartile
Bar Plots
sns.barplot(data=df, x='Category', y='Purchase Amount (USD)', errorbar=None, palette="pastel")
● Can accommodate for more categories
● Very intuitive, can add in a lot of other details i.e., counts
● Used to check for distribution of categories
Correlation Matrix Heatmap
sns.heatmap(correlation_matrix, annot=True, cmap=cmap, linewidths=0.5, square = True)
● Looks at correlation between all numerical values
● Extremely important to understand collinearity
● Useful to determine whether we need to remove highly correlated variables
Density Plot
sns.kdeplot(age_data, fill=True, legend=False)
● Look at the density distribution of continuous variables
● Useful for checking if the distribution is normal
● Easy to look for extreme outliers (e.g., high density of age > 80)
Choropleth Map
merged_data.plot(column='count', cmap='coolwarm', linewidth=0.8, ax=ax, edgecolor='0.8')
● Easily the coolest map (+10 style points)
● I added this map for fun
● Very useful in showing data distribution for geographical data
Data
Cleaning
Dealing with NA Values

df.dropna()

● Easiest method, just remove unclean data!

● What if NA values make up >20% of the data?

df[column].fillna()
● Fill NA values with something else
● Mean? Median?
● What about for categorical values?

What if data is not missing at random?

Dealing with NA Values

● In this example, all females’ Group is unknown

○ Data is not missing at random
○ Fill with ‘Unknown’

● What about Salary?

○ Impute with mean/median
○ Fill with 0

Gender Group Salary

Male A 50

Female ? 100

Male A ?

Female ? 50
Dealing with Extreme Values

Extreme outliers can skew models!

Some outliers may even be due to human error
Some data inherently will be extremely skewed: Salary, Age
df.drop()

● Easiest method also, just remove extreme outliers!

● What if it is not a normal distribution?

What if extreme data points tell us something important?

E.g., using salary to predict spending
Binning Numerical Data

Changing numerical features to categorical features

● Create groups for extreme values
○ <20: Age Group 0-20
○ >60: Age Group 61+
Learning from Documentation

SEABORN API : https://seaborn.pydata.org/api.html

● In Built examples of how you can use their plot methods

● Detailed explanation of customisations/parameters
Exercise

EDA Cheat Sheet - Exploratory Data Analysis
No ratings yet
EDA Cheat Sheet - Exploratory Data Analysis
2 pages
TruTops Help
88% (8)
TruTops Help
395 pages
Tourism Srs
47% (36)
Tourism Srs
22 pages
Cheat Sheet Data Preprocessing Tasks in Pandas
100% (1)
Cheat Sheet Data Preprocessing Tasks in Pandas
2 pages
Morphology Answers
No ratings yet
Morphology Answers
22 pages
Media Fandom Lets Speak Korean Learn Over 1400 Expressions Q
No ratings yet
Media Fandom Lets Speak Korean Learn Over 1400 Expressions Q
136 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Data Preprocessing Tasks in Pandas PYTHON
No ratings yet
Data Preprocessing Tasks in Pandas PYTHON
2 pages
Lecture 3 - Data Manipulation
No ratings yet
Lecture 3 - Data Manipulation
56 pages
21CS644 Module 4
No ratings yet
21CS644 Module 4
24 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Pandas Cheat Sheet Free Resources At: Dataquest - Io/guide
No ratings yet
Pandas Cheat Sheet Free Resources At: Dataquest - Io/guide
7 pages
SyamilFakhruddin - DS - Summary - Data Analysis
No ratings yet
SyamilFakhruddin - DS - Summary - Data Analysis
17 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Dataframing in CSV
No ratings yet
Dataframing in CSV
14 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Data Engineer Interview 1740985064
No ratings yet
Data Engineer Interview 1740985064
14 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
No ratings yet
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
25 pages
Introduction To Data Analytics: Instructor: Parisa Pouladzadeh Email: Parisa - Pouladzadeh@humber - Ca
No ratings yet
Introduction To Data Analytics: Instructor: Parisa Pouladzadeh Email: Parisa - Pouladzadeh@humber - Ca
81 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
20 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Data Analytics
No ratings yet
Data Analytics
34 pages
Cheat Sheet
No ratings yet
Cheat Sheet
15 pages
NumPy, SciPy, Pandas, Quandl Cheat Sheet
100% (3)
NumPy, SciPy, Pandas, Quandl Cheat Sheet
4 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Datavischeatsheet
No ratings yet
Datavischeatsheet
2 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Data Analytics and Interactive Dashboards Using Python
No ratings yet
Data Analytics and Interactive Dashboards Using Python
96 pages
Capstone Project
No ratings yet
Capstone Project
14 pages
DVP First Module
No ratings yet
DVP First Module
88 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
Phython Example
No ratings yet
Phython Example
12 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
Lecture Week5
No ratings yet
Lecture Week5
72 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Demo Lesson Plan in Hebrew Lit
No ratings yet
Demo Lesson Plan in Hebrew Lit
2 pages
Enrichment Activities 2
No ratings yet
Enrichment Activities 2
1 page
5 Chapter7tolerancing
No ratings yet
5 Chapter7tolerancing
99 pages
Marking Scheme Test 11111
No ratings yet
Marking Scheme Test 11111
8 pages
Journal of Endocrinology
No ratings yet
Journal of Endocrinology
11 pages
FT - (10th) English Language (Set-3) Pre Board
No ratings yet
FT - (10th) English Language (Set-3) Pre Board
10 pages
Hbt-Bms-E - MODBUS-Reference-Guide-BG0548-21-30
No ratings yet
Hbt-Bms-E - MODBUS-Reference-Guide-BG0548-21-30
10 pages
ICFAI Business School
100% (1)
ICFAI Business School
3 pages
DST Brainstorming Workshop C S Manohar
No ratings yet
DST Brainstorming Workshop C S Manohar
44 pages
Lab 2: More MATLAB and Working With Sounds: Objective
No ratings yet
Lab 2: More MATLAB and Working With Sounds: Objective
6 pages
Unit 2 Lecture 3 Business Letters
No ratings yet
Unit 2 Lecture 3 Business Letters
21 pages
Hamming Codes
No ratings yet
Hamming Codes
31 pages
Trade Life Cycle
50% (2)
Trade Life Cycle
2 pages
542 1713 1 PB
No ratings yet
542 1713 1 PB
8 pages
Presentasi Exposed Di PDAM Cilacap
No ratings yet
Presentasi Exposed Di PDAM Cilacap
22 pages
Onthly Newslette: ADSC Faculty Wins NAST Outstanding Book Award CA Tops Exam For Agriculturists Anew
No ratings yet
Onthly Newslette: ADSC Faculty Wins NAST Outstanding Book Award CA Tops Exam For Agriculturists Anew
4 pages
7 Types of Speech
No ratings yet
7 Types of Speech
17 pages
Colgate
100% (2)
Colgate
40 pages
Use of Non-Biodegradable Plastics in Flexible Pavement Construction
No ratings yet
Use of Non-Biodegradable Plastics in Flexible Pavement Construction
74 pages
Ecss e ST 40c (6march2009)
No ratings yet
Ecss e ST 40c (6march2009)
206 pages
MIL - Week 5
No ratings yet
MIL - Week 5
14 pages
U2 - 04 (Chart Types)
No ratings yet
U2 - 04 (Chart Types)
6 pages
Problems
No ratings yet
Problems
35 pages
Zinc Precipitation On Gold Recovery
No ratings yet
Zinc Precipitation On Gold Recovery
18 pages
Fundamental Law of Lights
100% (1)
Fundamental Law of Lights
14 pages
ENG-1003 Communication Skills and Academic Reporting I
No ratings yet
ENG-1003 Communication Skills and Academic Reporting I
27 pages

Lesson 1 - Data Visualisation

Uploaded by

Lesson 1 - Data Visualisation

Uploaded by

Intro to

Data Preparation & Exploratory Data Analysis

Feature Engineering & Feature Selection

Model Selection & Testing

Hyper-parameter Tuning & Overall Evaluation

● Sanity Checks to look for missing data, outliers, redundant

!pip install numpy==1.18.5

Reading CSV file

Reading top & bottom rows

df.drop(['Gender'], axis = 1, inplace = True)

Customer ID Age Payment Method

1 55 Venmo Customer ID Age Payment Method

2 NaN Cash 1 55 Venmo

Customer ID Age Payment Method Customer ID Age Payment Method

2 NaN Cash 2 0 Cash

Categorical Variables Numerical Data Time-Series Data

Geospatial Data Others

● Easiest method, just remove unclean data!

What if data is not missing at random?

● In this example, all females’ Group is unknown

● What about Salary?

Gender Group Salary

Extreme outliers can skew models!

● Easiest method also, just remove extreme outliers!

What if extreme data points tell us something important?

Changing numerical features to categorical features

SEABORN API : https://seaborn.pydata.org/api.html

● In Built examples of how you can use their plot methods

You might also like