0% found this document useful (0 votes)

75 views12 pages

Python For Exploratory Data Analysis

Cheat Sheet PDA

Uploaded by

Muhammad Faizan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views12 pages

Python For Exploratory Data Analysis

Cheat Sheet PDA

Uploaded by

Muhammad Faizan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Python for Exploratory Data

Analysis (Workshop)

Proposal:
Exploratory Data Analysis (EDA) is about getting an overall understanding of data.
EDA includes exploring data to find its main characteristics, identifying patterns and
visualizations. EDA provides meaningful insights into data to be used in a variety of
applications e.g,. machine learning. Python can be effectively used to do EDA as it
has a rich set of easy-to-use libraries like Pandas, Seaborn, Numpy and Matplotlib.
In this workshop we will cover basics of EDA using a real world data set, including,
but not limited to, Correlating, Converting, Completing, Correcting, Creating and
Charting the data. In addition we will learn how to install and use Jupyter Notebooks
(an open-source web application that allows you to create and share documents that
contain live code, equations, visualizations and narrative text).

Setting up Requirements:
First step is to understand and install all requirements. It also includes acquiring data (on
which EDA is going to be done) from a given github link.
Following steps would be completed on all attendant's machines.
● Make sure python is installed and working (Python 2)
● A brief introduction on python virtual environment
○ Virtual environment is a self-contained directory tree that contains a
Python installation for a particular version of Python, plus a number of
additional packages.
● Create a virtual environment
● A brief introduction on jupyter notebooks
○ https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_i
s_jupyter.html
● Install Jupyter notebook
○ http://jupyter.org/install.html
● Get data and requirement file from
https://github.com/noraiz-anwar/exploratory-data-analysis
● Install all requirements using pip from given requirement file
● Check all requirements are satisfied
A brief introduction of installed libraries:

We will be using installed libraries to perform different operations on data. Let’s

explore these libraries a bit.
● Numpy
○ NumPy is a library for the Python programming language, adding
support for large, multi-dimensional arrays and matrices, along with a
large collection of high-level mathematical functions to operate on
these arrays.
○ http://www.numpy.org/
● Pandas
○ pandas is a python package providing fast, flexible, and expressive
data structures designed to make working with “relational” or “labeled”
data both easy and intuitive. It aims to be the fundamental high-level
building block for doing practical, real world data analysis in Python.
Additionally, it has the broader goal of becoming the most powerful and
flexible open source data analysis / manipulation tool available in any
language. It is already well on its way toward this goal.
○ https://pandas.pydata.org/
● Seaborn
○ Seaborn is a Python data visualization library based on matplotlib. It
provides a high-level interface for drawing attractive and informative
statistical graphics.
○ https://seaborn.pydata.org/
● Matplotlib
○ Matplotlib is a Python 2D plotting library which produces publication
quality figures in a variety of hardcopy formats and interactive
environments across platforms.
○ https://matplotlib.org/
Introduction of data:
We will be using data of olympic games here. This data holds 120 years of olympic
history including bio of athletes and information about the game they participated in.

The file athlete_events.csv contains 271116 rows and 15 columns; Each row
corresponds to an individual athlete competing in an individual Olympic event
(athlete-events). Columns are the following:

1. ID - Unique number for each athlete;

2. Name - Athlete's name;
3. Sex - M or F;
4. Age - Integer;
5. Height - In centimeters;
6. Weight - In kilograms;
7. Team - Team name;
8. NOC - National Olympic Committee 3-letter code;
9. Games - Year and season;
10. Year - Integer;
11. Season - Summer or Winter;
12. City - Host city;
13. Sport - Sport;
14. Event - Event;
15. Medal - Gold, Silver, Bronze, or NA.

The file noc_regions.csv contains 230 rows and 3 columns. Each row contains a
NOC and its related region and any notes. Columns are following:
1. NOC - National Olympic Committee 3-letter code;
2. Region - Name of country
3. Notes - String containing any useful information about region and NOC

Importing Data into Data Frames:

To start working on data first we need to import data from csv files to pandas DataFrame.
This will be done using pandas’ read_csv method. We will further learn how different
delimiters are used by this function.
Collecting basic information about data:
We need to make sense of our data about how does it look like. We will explore some more
pandas’ function here like
● See data in tabular form using head.

● Descriptive statistics using pandas’ describe

● Overall summary of DataFrame

● we want to find out if there are any null values in columns. Check using pandas’
isnull.
Querying Data:
Run different queries on data to extract further knowledge from data. We will discuss
following important concepts and techniques..

Understanding Boolean Indexing:

Boolean indexing is used to perform general queries on a given pandas dataframe. This is
an important concept to grasp. We will perform different operations on data to understand it
e.g
● Count/Find how many records without any medal mentioned.
● Count/Find most young and most old people who got Gold medal
● Count/Find number of gold medals won by women of any specific country in a
particular year

Explore some builtin functions:

We would explore some important panda library functions by using them e.g
● notnull
● loc
● Groupby
● Value_counts
● Pivot_table
● reindex

Cleaning and Completing Data:

At this point we are well aware of our data. We know that it has some missing values. We
will perform different operations on it. E.g
● Exclude all records from data where we don’t have any information about medals.
● Fill missing age values with average age of other athletes.
● Fill missing height values for women and men with average height of women and
men athletes respectively.
● Fill missing weight values for women and men with average weight of women and
men athletes participating in same sports
Data Visualization:
Visualizing data in different type of graphs will provide us with greater insights into our data.
We will explore different options on visualizing our data and find out any patterns within it.
From now on we will be using our previous knowledge of pandas library and try to grasp new
concepts of seaborn and matplotlib.

Countplot examples:

1. Gold medals in gymnastic over age

2. Medals won by China over years

3. Gold medals won by china in summer olympics in sports

Pointplot examples:

1. Height of male athletes over years.

2. Height of female athletes over years.

Barplot examples:

1. Top 5 countries with most medals

2. Number of athletes in each olympic game

Boxplot Examples:

1. Age distribution of male/female in Olympic games

2. Variation of age for female over time

Scatterplot example:
Height and weight ratio of athletes
Heatmap example:
1. Average age of medal winners in olympic games.

In addition to this we will be discussing and analysing trends and patterns while visualizing
the data.
Here I have given some examples only. We may draw some additional graphs as we
continue to learn more and more about it.

References:
● Data is taken from
https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
● This work is inspired by my fellow learners at kaggle:
○ https://www.kaggle.com/marcogdepinto/let-s-discover-more-about-the-olympi
c-games
○ https://www.kaggle.com/arunsankar/key-insights-from-olympic-history-data
○ And from other kaggle and great documentation of python libraries.

Primark - Full Factory List (En) - 2023
No ratings yet
Primark - Full Factory List (En) - 2023
75 pages
Tcode Description: 000000sensitivity: Internal Restricted
No ratings yet
Tcode Description: 000000sensitivity: Internal Restricted
7 pages
Richc Dad Financial Statement Template
No ratings yet
Richc Dad Financial Statement Template
10 pages
DSA Lab Manual Pgms - fINAL
No ratings yet
DSA Lab Manual Pgms - fINAL
34 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Practical File 2024
No ratings yet
Practical File 2024
25 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Olympic Data Minor Project 5th Sem
No ratings yet
Olympic Data Minor Project 5th Sem
23 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
Practical7 Python Programming
No ratings yet
Practical7 Python Programming
6 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Aadarsh
No ratings yet
Aadarsh
26 pages
Aids Lab
No ratings yet
Aids Lab
45 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Exercise1 Problem
No ratings yet
Exercise1 Problem
2 pages
Pandas NEW
No ratings yet
Pandas NEW
7 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
BDA File
No ratings yet
BDA File
26 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Guidelines DAVP
No ratings yet
Guidelines DAVP
3 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
Data Frame
No ratings yet
Data Frame
95 pages
Olympics Data Analysis - ML - FA - DA Projects
No ratings yet
Olympics Data Analysis - ML - FA - DA Projects
55 pages
Pandas
No ratings yet
Pandas
25 pages
CS1010S Lecture 11 - Visualising Data
No ratings yet
CS1010S Lecture 11 - Visualising Data
68 pages
DAVP Lab Manual
No ratings yet
DAVP Lab Manual
12 pages
DMV Unit-4-1 PDF
No ratings yet
DMV Unit-4-1 PDF
10 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
CRAI AI BOOTCAMP week two 2025
No ratings yet
CRAI AI BOOTCAMP week two 2025
29 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Mohit
No ratings yet
Mohit
19 pages
Practical 1
No ratings yet
Practical 1
5 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Practical List 2022-23
100% (1)
Practical List 2022-23
4 pages
DS - Lab Manual
No ratings yet
DS - Lab Manual
31 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
CS 3362 FDS
No ratings yet
CS 3362 FDS
53 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Data Sci
No ratings yet
Data Sci
10 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
AD3411 - 1 To 5
No ratings yet
AD3411 - 1 To 5
11 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Practical_2_fKs4RPadH3 (1)
No ratings yet
Practical_2_fKs4RPadH3 (1)
4 pages
Server Hosting Management System (Ip Class 12) (2024-25)
No ratings yet
Server Hosting Management System (Ip Class 12) (2024-25)
21 pages
Python
No ratings yet
Python
32 pages
Machine Learning Experiment
No ratings yet
Machine Learning Experiment
69 pages
Ip Practical File
No ratings yet
Ip Practical File
23 pages
Python Codes
No ratings yet
Python Codes
28 pages
Dev Record Final
No ratings yet
Dev Record Final
34 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Esu105b Surveying I Notes 2024 A
No ratings yet
Esu105b Surveying I Notes 2024 A
143 pages
Invoice Details For Plab
No ratings yet
Invoice Details For Plab
3 pages
Tiana - Google Search
No ratings yet
Tiana - Google Search
1 page
12th Activity 1
No ratings yet
12th Activity 1
6 pages
A Review of Literature On Emotional Intelligence: Doa Naqvi
No ratings yet
A Review of Literature On Emotional Intelligence: Doa Naqvi
14 pages
C Y9 M93 DPLwe NXTQe NLSW 2 TLNT 8 Oh U7 Nu
No ratings yet
C Y9 M93 DPLwe NXTQe NLSW 2 TLNT 8 Oh U7 Nu
1 page
Foundations of Microeconomics 7 Ed Bade
No ratings yet
Foundations of Microeconomics 7 Ed Bade
307 pages
P.D. No. 223
No ratings yet
P.D. No. 223
1 page
Music Listening and Critical Thinking
No ratings yet
Music Listening and Critical Thinking
15 pages
David Pearson v. SE Property Holdings, LLC, 11th Cir. (2013)
No ratings yet
David Pearson v. SE Property Holdings, LLC, 11th Cir. (2013)
7 pages
Tensor and General Relativity
100% (1)
Tensor and General Relativity
88 pages
Modular Test 2
No ratings yet
Modular Test 2
7 pages
Cisco Stealthwatch: Cisco Threat Response Integration Guide 7.1.2
No ratings yet
Cisco Stealthwatch: Cisco Threat Response Integration Guide 7.1.2
23 pages
Chap 12 PM-BB Multiple Choice Type Questions
No ratings yet
Chap 12 PM-BB Multiple Choice Type Questions
24 pages
Unit Operations in Mineral Processing: Prof. Rodrigo Serna and Dr. Robert Hartmann Spring 2019 Aalto University
No ratings yet
Unit Operations in Mineral Processing: Prof. Rodrigo Serna and Dr. Robert Hartmann Spring 2019 Aalto University
46 pages
Technical Analyst
100% (1)
Technical Analyst
50 pages
The Ultimate 5-Ingredient Cookbook - Fast and Flavorful 5 Ingredients or Less Recipes For Any Skill Leve
100% (1)
The Ultimate 5-Ingredient Cookbook - Fast and Flavorful 5 Ingredients or Less Recipes For Any Skill Leve
105 pages
BR SprayMaster
No ratings yet
BR SprayMaster
16 pages
Vocabulary + Grammar Unit 1 Test A PDF
100% (1)
Vocabulary + Grammar Unit 1 Test A PDF
3 pages
Social Studies Lesson Exemplar
100% (1)
Social Studies Lesson Exemplar
8 pages
Pre-Alternative Algebras and Pre-Alternative Bialgebras: Abstract
No ratings yet
Pre-Alternative Algebras and Pre-Alternative Bialgebras: Abstract
34 pages
Wireless Sensing and Networking For The Internet of Things Zihuai Lin and Wei Xiang Download
No ratings yet
Wireless Sensing and Networking For The Internet of Things Zihuai Lin and Wei Xiang Download
79 pages
Kelompok 1 - PPT LTE
No ratings yet
Kelompok 1 - PPT LTE
15 pages
THINK L2 Unit 4 Vocabulary Extension
No ratings yet
THINK L2 Unit 4 Vocabulary Extension
2 pages
Ethics Notes
100% (2)
Ethics Notes
47 pages
Pubmed Microneedl Set
No ratings yet
Pubmed Microneedl Set
3 pages
Playing With Thy Name
No ratings yet
Playing With Thy Name
1 page

Python For Exploratory Data Analysis

Uploaded by

Python For Exploratory Data Analysis

Uploaded by

Python for Exploratory Data

We will be using installed libraries to perform different operations on data. Let’s

1. ID - Unique number for each athlete;

Importing Data into Data Frames:

● Descriptive statistics​ using pandas’ ​describe

Understanding Boolean Indexing:

Explore some builtin functions:

Cleaning and Completing Data:

1. Gold medals in gymnastic over age

2. Medals won by China over years

1. Height of male athletes over years.

1. Top 5 countries with most medals

1. Age distribution of male/female in Olympic games

You might also like

● Descriptive statistics using pandas’ describe