0% found this document useful (0 votes)
26 views

Python For Exploratory Data Analysis

Cheat Sheet PDA

Uploaded by

Muhammad Faizan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Python For Exploratory Data Analysis

Cheat Sheet PDA

Uploaded by

Muhammad Faizan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Python for Exploratory Data

Analysis (Workshop)

Proposal:
Exploratory Data Analysis (EDA) is about getting an overall understanding of data.
EDA includes exploring data to find its main characteristics, identifying patterns and
visualizations. EDA provides meaningful insights into data to be used in a variety of
applications e.g,. machine learning. Python can be effectively used to do EDA as it
has a rich set of easy-to-use libraries like Pandas, Seaborn, Numpy and Matplotlib.
In this workshop we will cover basics of EDA using a real world data set, including,
but not limited to, Correlating, Converting, Completing, Correcting, Creating and
Charting the data. In addition we will learn how to install and use Jupyter Notebooks
(an open-source web application that allows you to create and share documents that
contain live code, equations, visualizations and narrative text).

Setting up Requirements:
First step is to understand and install all requirements. It also includes acquiring data (on
which EDA is going to be done) from a given github link.
Following steps would be completed on all attendant's machines.
● Make sure python is installed and working (Python 2)
● A brief introduction on python virtual environment
○ Virtual environment is a self-contained directory tree that contains a
Python installation for a particular version of Python, plus a number of
additional packages.
● Create a virtual environment
● A brief introduction on jupyter notebooks
○ https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_i
s_jupyter.html
● Install Jupyter notebook
○ http://jupyter.org/install.html
● Get data and requirement file from
https://github.com/noraiz-anwar/exploratory-data-analysis
● Install all requirements using pip from given requirement file
● Check all requirements are satisfied
A brief introduction of installed libraries:

We will be using installed libraries to perform different operations on data. Let’s


explore these libraries a bit.
● Numpy
○ NumPy is a library for the Python programming language, adding
support for large, multi-dimensional arrays and matrices, along with a
large collection of high-level mathematical functions to operate on
these arrays.
○ http://www.numpy.org/
● Pandas
○ pandas is a python package providing fast, flexible, and expressive
data structures designed to make working with “relational” or “labeled”
data both easy and intuitive. It aims to be the fundamental high-level
building block for doing practical, real world data analysis in Python.
Additionally, it has the broader goal of becoming the most powerful and
flexible open source data analysis / manipulation tool available in any
language. It is already well on its way toward this goal.
○ https://pandas.pydata.org/
● Seaborn
○ Seaborn is a Python data visualization library based on matplotlib. It
provides a high-level interface for drawing attractive and informative
statistical graphics.
○ https://seaborn.pydata.org/
● Matplotlib
○ Matplotlib is a Python 2D plotting library which produces publication
quality figures in a variety of hardcopy formats and interactive
environments across platforms.
○ https://matplotlib.org/
Introduction of data:
We will be using data of olympic games here. This data holds 120 years of olympic
history including bio of athletes and information about the game they participated in.

The file athlete_events.csv contains 271116 rows and 15 columns; Each row
corresponds to an individual athlete competing in an individual Olympic event
(athlete-events). Columns are the following:

1. ID - Unique number for each athlete;


2. Name - Athlete's name;
3. Sex - M or F;
4. Age - Integer;
5. Height - In centimeters;
6. Weight - In kilograms;
7. Team - Team name;
8. NOC - National Olympic Committee 3-letter code;
9. Games - Year and season;
10. Year - Integer;
11. Season - Summer or Winter;
12. City - Host city;
13. Sport - Sport;
14. Event - Event;
15. Medal - Gold, Silver, Bronze, or NA.

The file noc_regions.csv contains 230 rows and 3 columns. Each row contains a
NOC and its related region and any notes. Columns are following:
1. NOC - ​ National Olympic Committee 3-letter code;
2. Region - Name of country
3. Notes - String containing any useful information about region and NOC

Importing Data into Data Frames:


To start working on data first we need to import data from csv files to pandas ​DataFrame.
This will be done using pandas’ ​read_csv​ method. We will further learn how different
delimiters are used by this function.
Collecting basic information about data:
We need to make sense of our data about how does it look like. We will explore some more
pandas’ function here like
● See data in tabular form using ​head​.

● Descriptive statistics​ using pandas’ ​describe


● Overall summary of DataFrame

● we want to find out if there are any null values in columns. Check using pandas’
isnull​.
Querying Data:
Run different queries on data to extract further knowledge from data. We will discuss
following important concepts and techniques..

Understanding Boolean Indexing:


Boolean indexing is used to perform general queries on a given pandas dataframe. This is
an important concept to grasp. We will perform different operations on data to understand it
e.g
● Count/Find how many records without any medal mentioned.
● Count/Find most young and most old people who got Gold medal
● Count/Find number of gold medals won by women of any specific country in a
particular year

Explore some builtin functions:


We would explore some important panda library functions by using them e.g
● notnull
● loc
● Groupby
● Value_counts
● Pivot_table
● reindex

Cleaning and Completing Data:


At this point we are well aware of our data. We know that it has some missing values. We
will perform different operations on it. E.g
● Exclude all records from data where we don’t have any information about medals.
● Fill missing age values with average age of other athletes.
● Fill missing height values for women and men with average height of women and
men athletes respectively.
● Fill missing weight values for women and men with average weight of women and
men athletes participating in same sports
Data Visualization:
Visualizing data in different type of graphs will provide us with greater insights into our data.
We will explore different options on visualizing our data and find out any patterns within it.
From now on we will be using our previous knowledge of pandas library and try to grasp new
concepts of seaborn and matplotlib.

Countplot​ examples:

1. Gold medals in gymnastic over age

2. Medals won by China over years


3. Gold medals won by china in summer olympics in sports

Pointplot​ examples:

1. Height of male athletes over years.


2. Height of female athletes over years.

Barplot​ examples:

1. Top 5 countries with most medals


2. Number of athletes in each olympic game

Boxplot​ Examples:

1. Age distribution of male/female in Olympic games


2. Variation of age for female over time

Scatterplot​ example:
Height and weight ratio of athletes
Heatmap​ example:
1. Average age of medal winners in olympic games.

In addition to this we will be discussing and analysing trends and patterns while visualizing
the data.
Here I have given some examples only. We may draw some additional graphs as we
continue to learn more and more about it.

References:
● Data is taken from
https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
● This work is inspired by my fellow learners at kaggle:
○ https://www.kaggle.com/marcogdepinto/let-s-discover-more-about-the-olympi
c-games
○ ​https://www.kaggle.com/arunsankar/key-insights-from-olympic-history-data
○ And from other kaggle and great documentation of python libraries.

You might also like