Python For Exploratory Data Analysis
Python For Exploratory Data Analysis
Analysis (Workshop)
Proposal:
Exploratory Data Analysis (EDA) is about getting an overall understanding of data.
EDA includes exploring data to find its main characteristics, identifying patterns and
visualizations. EDA provides meaningful insights into data to be used in a variety of
applications e.g,. machine learning. Python can be effectively used to do EDA as it
has a rich set of easy-to-use libraries like Pandas, Seaborn, Numpy and Matplotlib.
In this workshop we will cover basics of EDA using a real world data set, including,
but not limited to, Correlating, Converting, Completing, Correcting, Creating and
Charting the data. In addition we will learn how to install and use Jupyter Notebooks
(an open-source web application that allows you to create and share documents that
contain live code, equations, visualizations and narrative text).
Setting up Requirements:
First step is to understand and install all requirements. It also includes acquiring data (on
which EDA is going to be done) from a given github link.
Following steps would be completed on all attendant's machines.
● Make sure python is installed and working (Python 2)
● A brief introduction on python virtual environment
○ Virtual environment is a self-contained directory tree that contains a
Python installation for a particular version of Python, plus a number of
additional packages.
● Create a virtual environment
● A brief introduction on jupyter notebooks
○ https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_i
s_jupyter.html
● Install Jupyter notebook
○ http://jupyter.org/install.html
● Get data and requirement file from
https://github.com/noraiz-anwar/exploratory-data-analysis
● Install all requirements using pip from given requirement file
● Check all requirements are satisfied
A brief introduction of installed libraries:
The file athlete_events.csv contains 271116 rows and 15 columns; Each row
corresponds to an individual athlete competing in an individual Olympic event
(athlete-events). Columns are the following:
The file noc_regions.csv contains 230 rows and 3 columns. Each row contains a
NOC and its related region and any notes. Columns are following:
1. NOC - National Olympic Committee 3-letter code;
2. Region - Name of country
3. Notes - String containing any useful information about region and NOC
● we want to find out if there are any null values in columns. Check using pandas’
isnull.
Querying Data:
Run different queries on data to extract further knowledge from data. We will discuss
following important concepts and techniques..
Countplot examples:
Pointplot examples:
Barplot examples:
Boxplot Examples:
Scatterplot example:
Height and weight ratio of athletes
Heatmap example:
1. Average age of medal winners in olympic games.
In addition to this we will be discussing and analysing trends and patterns while visualizing
the data.
Here I have given some examples only. We may draw some additional graphs as we
continue to learn more and more about it.
References:
● Data is taken from
https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
● This work is inspired by my fellow learners at kaggle:
○ https://www.kaggle.com/marcogdepinto/let-s-discover-more-about-the-olympi
c-games
○ https://www.kaggle.com/arunsankar/key-insights-from-olympic-history-data
○ And from other kaggle and great documentation of python libraries.