Python For Data Analysis Jan 28
Python For Data Analysis Jan 28
Presented by
Ambati Rami Reddy
Ph.D. Scholar (PMRF)
Department of CSE
07/29/2025 1
Overview of Python Libraries for Data
Tutorial Content Scientists
Descriptive statistics
Inferential statistics
2
Python Libraries for Data Science
Many popular Python toolboxes/libraries:
• NumPy
• SciPy All these libraries are
• Pandas installed on the SCC
• SciKit-Learn
Visualization libraries
• matplotlib
• Seaborn
Link: http://www.numpy.org/
4
Python Libraries for Data Science
SciPy:
collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more
built on NumPy
Link: https://www.scipy.org/scipylib/
5
Python Libraries for Data Science
Pandas:
adds data structures and tools designed to work with table-like data (similar
to Series and Data Frames in R)
Link: http://pandas.pydata.org/
6
Python Libraries for Data Science
SciKit-Learn:
provides machine learning algorithms: classification, regression, clustering,
model validation etc.
Link: http://scikit-learn.org/
7
Python Libraries for Data Science
matplotlib:
python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats
Link: https://matplotlib.org/
8
Python Libraries for Data Science
Seaborn:
based on matplotlib
Link: https://seaborn.pydata.org/
9
Loading Python Libraries
In [ ]: #Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns
10
Data Frame data types
Pandas Type Native Python Type Description
object string The most general dtype. Will be
assigned to your column if column
has mixed types (numbers and
strings).
datetime64, timedelta[ns] N/A (but see the datetime module Values meant to hold time data.
in Python’s standard library) Look into these for time series
experiments.
11
Data Frames attributes
Python objects have attributes and methods.
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
12
Hands-on exercises
13
Data Frames methods
Unlike attributes, python methods have parenthesis.
All attributes and methods can be listed with a dir() function: dir(df)
df.method() description
head( [n] ), tail( [n] ) first/last n rows
What are the mean values of the first 50 records in the dataset? Hint: use
head() method to subset the first 50 records and then calculate the mean
15
Selecting a column in a Data Frame
Method 1: Subset the data frame using column name:
df['sex']
Note: there is an attribute rank for pandas data frames, so to select a column with a name
"rank" we should use method 1.
16
Data Frames groupby method
Using "group by" method we can:
• Split the data into groups based on some criteria
• Calculate statistics (or apply a function) to each group
• Similar to dplyr() function in R
In [ ]: #Group data using rank
df_rank = df.groupby(['rank'])
In [ ]: #Calculate mean value for each numeric column per each group
df_rank.mean()
17
Data Frames groupby method
Once groupby object is create we can calculate various statistics for each group:
In [ ]: #Calculate mean salary for each professor rank:
df.groupby('rank')[['salary']].mean()
Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object.
When double brackets are used the output is a Data Frame
18
Data Frames: method iloc
(summary)
df.iloc[0] # First row of a data frame
df.iloc[i] #(i+1)th row
df.iloc[-1] # Last row
19
Data Frames: Sorting
We can sort the data by a value in the column. By default the sorting will occur in
ascending order and a new data frame is return.
In [ ]: # Create a new data frame from the original sorted by the column Salary
df_sorted = df.sort_values( by ='service')
df_sorted.head()
Out[ ]:
20
Data Frames: Sorting
Out[ ]:
21
Missing Values
Missing values are marked as NaN
In [ ]: # Read a dataset with missing values
flights = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/flights.csv")
Out[ ]:
22
Missing Values
There are a number of methods to deal with missing values in the data frame:
df.method() description
dropna() Drop missing observations
24
Aggregation Functions in Pandas
Aggregation - computing a summary statistic about each group, i.e.
• compute group sums or means
• compute group sizes/counts
min, max
count, sum, prod
mean, median, mode, mad
std, var
25
Aggregation Functions in Pandas
agg() method are useful when multiple statistics are computed per column:
In [ ]: flights[['dep_delay','arr_delay']].agg(['min','mean','max'])
Out[ ]:
26
Basic Descriptive Statistics
df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)
kurt kurtosis
27
Graphics to explore the data
Seaborn package is built on matplotlib but provides high level
interface for drawing attractive statistical graphics, similar to ggplot2
library in R. It specifically targets statistical data visualization
In [ ]: %matplotlib inline
28
Graphics
description
distplot histogram
barplot estimate of central tendency for a numeric variable
violinplot similar to boxplot, also shows the probability density of the
data
jointplot Scatterplot
regplot Regression plot
pairplot Pairplot
boxplot boxplot
swarmplot categorical scatterplot
factorplot General categorical plot
29
Basic statistical Analysis
statsmodel and scikit-learn - both have a number of function for statistical analysis
The first one is mostly used for regular analysis using R style formulas, while scikit-learn is
more tailored for Machine Learning.
statsmodels:
• linear regressions
• ANOVA tests
• hypothesis testings
• many more ...
scikit-learn:
• kmeans
• support vector machines
• random forests
• many more ...
“I can store numbers and other objects in a Python list and do all sorts
of computations and manipulations through list comprehensions, for-
loops etc. What do I need a NumPy array for?”
In Ou
: t:
Differences between lists and
ndarrays
• The key difference between an array and a list is that arrays are
designed to handle vectorised operations while a python lists are not.
• That means, if you apply a function, it is performed on every item in
the array, rather than on the whole array object.
• Let’s suppose you want to add the number 2 to every item in the list.
The intuitive way to do this is something like this:
In Ou
: t:
• That was not possible with a list, but you can do that on an array:
In Ou
: t:
• It should be noted here that, once a Numpy array is created, you
cannot increase its size.
• To do so, you will have to create a new array.
Create a 2d array from a list of list
• You can pass a list of lists to create a matrix-like a 2d array.
In
Ou
:
t:
The dtype argument
• You can specify the data-type by setting the dtype() argument.
• Some of the most commonly used NumPy dtypes are: float, int,
bool, str, and object.
In
Ou
:
t:
The astype argument
• You can also convert it to a different data-type using the astype method.
In Ou
: t:
• Remember that, unlike lists, all items in an array have to be of the same
type.
dtype=‘object’
• However, if you are uncertain about what data type your array will
hold, or if you want to hold characters and numbers in the same array,
you can set the dtype as 'object'.
In Ou
: t:
The tolist() function
• You can always convert an array into a list using the tolist()
command.
In Ou
: t:
Inspecting a NumPy array
• There are a range of functions built into NumPy that allow you to
inspect different aspects of an array:
In
: Ou
t:
Extracting specific items from an
array
• You can extract portions of the array using indices, much like when
you’re working with lists.
• Unlike lists, however, arrays can optionally accept as many parameters
in the square brackets as there are number of dimensions
In Ou
: t:
Boolean indexing
• A boolean index array is of the same shape as the array-to-be-filtered,
but it only contains TRUE and FALSE values.
In Ou
: t:
Pandas
• Pandas, like NumPy, is one of the most popular Python libraries for
data analysis.
• It is a high-level abstraction over low-level NumPy, which is written in
pure C.
• Pandas provides high-performance, easy-to-use data structures and
data analysis tools.
• There are two main structures used by pandas; data frames and
series.
Indices in a pandas series
• A pandas series is similar to a list, but differs in the fact that a series associates a label
with each element. This makes it look like a dictionary.
• If an index is not explicitly provided by the user, pandas creates a RangeIndex ranging
from 0 to N-1.
• Each series object also has a data type.
In O
: ut
:
• As you may suspect by this point, a series has ways to extract all of
the values in the series, as well as individual elements by index.
In O
: ut
:
Ou
In t:
:
Filtering and maths operations
• Filtering and maths operations are easy with Pandas as well.
In O
: ut
:
Pandas data frame
• Simplistically, a data frame is a table, with rows and columns.
• Each column in a data frame is a series object.
• Rows consist of elements inside series.
Ou
t:
• You can also create a data frame from a list.
In Ou
: t:
• You can ascertain the type of a column with the type() function.
In
:
Ou
t:
• A Pandas data frame object as two indices; a column index and row
index.
• Again, if you do not provide one, Pandas will create a RangeIndex
from 0 to N-1.
In
:
Ou
t:
• There are numerous ways to provide row indices explicitly.
• For example, you could provide an index when creating a data frame:
In Ou
: t:
• or do it during runtime.
• Here, I also named the index ‘country code’.
Ou
In t:
:
• Row access using index can be performed in several ways.
• First, you could use .loc() and provide an index label.
In Ou
: t:
In Ou
: t:
• A selection of particular rows and columns can be selected this way.
In Ou
: t:
• You can feed .loc() two arguments, index list and column list, slicing
operation is supported as well:
In Ou
: t:
Filtering
• Filtering is performed using so-called Boolean arrays.
Deleting columns
• You can delete a column using the drop() function.
In Ou
: t:
In Ou
: t:
Reading from and writing to a file
• Pandas supports many popular file formats including CSV, XML, HTML,
Excel, SQL, JSON, etc.
• Out of all of these, CSV is the file format that you will work with the most.
• You can read in the data from a CSV file using the read_csv() function.
• Similarly, you can write a data frame to a csv file with the to_csv()
function.
• Pandas has the capacity to do much more than what we have
covered here, such as grouping data and even data visualisation.
• However, as with NumPy, we don’t have enough time to cover every
aspect of pandas here.
Exploratory data analysis (EDA)
Exploring your data is a crucial step in data analysis. It involves:
• Organising the data set
• Plotting aspects of the data set
• Maybe producing some numerical summaries; central tendency and
spread, etc.
“Exploratory data analysis can never be the whole story, but nothing
else can serve as the foundation stone.”
- John Tukey.
Download the data
• Download the Pokemon dataset from:
https://github.com/LewBrace/da_and_vis_python
• Unzip the folder, and save the data file in a location you’ll remember.
Reading in the data
• First we import the Python packages we are going to use.
• Then we use Pandas to load in the dataset as a data frame.
NOTE: The argument index_col argument states that we'll treat the first
column of the dataset as the ID column.
NOTE: The encoding argument allows us to by pass an input error created
by special characters in the data set.
Examine the data set
• We could spend time staring at these
numbers, but that is unlikely to offer
us any form of insight.
• We could begin by conducting all of
our statistical tests.
• However, a good field commander
never goes into battle without first
doing a recognisance of the terrain…
• This is exactly what EDA is for…
Plotting a histogram in Python
Bins
• You may have noticed the two histograms we’ve seen so far look different,
despite using the exact same data.
• This is because they have different bin values.
• The left graph used the default bins generated by plt.hist(), while the one on
the right used bins that I specified.
• There are a couple of ways to manipulate bins in matplotlib.
• Here, I specified where the edges of the bars of the histogram are; the
bin edges.
• You could also specify the number of bins, and Matplotlib will
automatically generate a number of evenly spaced bins.
Seaborn
• Matplotlib is a powerful, but sometimes unwieldy, Python library.
• Seaborn provides a high-level interface to Matplotlib and makes it
easier to produce graphs like the one on the right.
• Some IDEs incorporate elements of this “under the hood” nowadays.
Benefits of Seaborn
• Seaborn offers:
- Using default themes that are aesthetically pleasing.
- Setting custom colour palettes.
- Making attractive statistical plots.
- Easily and flexibly displaying distributions.
- Visualising information from matrices and DataFrames.
• The last three points have led to Seaborn becoming the exploratory
data analysis tool of choice for many Python users.
Plotting with Seaborn
• One of Seaborn's greatest strengths is its diversity of plotting
functions.
• Most plots can be created with one line of code.
• For example….
Histograms
• Allow you to plot the distributions of numeric variables.
Other types of graphs: Creating a
scatter plot
Name of our
Name of variable we dataframe fed to the
want on the x-axis “data=“ command
Colour by stage.
Separate by stage.
Generate using a swarmplot.
Rotate axis on x-ticks by 45 degrees.
A box plot
• The total, stage, and legendary entries are not combat stats so we should
remove them.
• Pandas makes this easy to do, we just create a new dataframe
• We just use Pandas’ .drop() function to create a dataframe that doesn’t
include the variables we don’t want.
Seaborn’s theme
• Seaborn has a number of themes you can use to alter the appearance
of plots.
• For example, we can use “whitegrid” to add grid lines to our boxplot.
Violin plots
• Violin plots are useful alternatives to box plots.
• They show the distribution of a variable through the thickness of the violin.
• Here, we visualise the distribution of attack by Pokémon's primary type:
• Dragon types tend to have higher Attack stats than Ghost types, but they also have greater
variance. But there is something not right here….
• The colours!
Seaborn’s colour palettes
• Seaborn allows us to easily set custom colour palettes by providing it
with an ordered list of colour hex values.
• We first create our colours list.
• Then we just use the palette= function and feed in our colours list.
• Because of the limited number of observations, we could also use a
swarm plot.
• Here, each data point is an observation, but data points are grouped
together by the variable listed on the x-axis.
Overlapping plots
• Both of these show similar information, so it might be useful to
overlap them.
Set size of print canvas.
• All 6 of the stat columns have been "melted" into one, and
the new Stat column indicates the original stat (HP, Attack,
Defense, Sp. Attack, Sp. Defense, or Speed).
• It's hard to see here, but each pokemon now has 6 rows of
data; hende the melted_df has 6 times more rows of data.
• This graph could be made to look nicer with a few tweaks.