0% found this document useful (0 votes)
53 views172 pages

DSP U2

This document outlines the course structure for 'Data Science using Python' at RMK Group of Educational Institutions, including objectives, prerequisites, syllabus, and course outcomes. It covers various topics such as Python libraries, classification, clustering, and data visualization, along with a detailed lecture plan and assessment schedule. The document emphasizes confidentiality and is intended solely for educational purposes.

Uploaded by

chan22006.cd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views172 pages

DSP U2

This document outlines the course structure for 'Data Science using Python' at RMK Group of Educational Institutions, including objectives, prerequisites, syllabus, and course outcomes. It covers various topics such as Python libraries, classification, clustering, and data visualization, along with a detailed lecture plan and assessment schedule. The document emphasizes confidentiality and is intended solely for educational purposes.

Uploaded by

chan22006.cd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 172

Please read this disclaimer before

proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
22AI302
DATA SCIENCE USING
PYTHON
Department: AI & DS
Batch/Year: 2022-2026 /II YEAR
Created by:
Ms.Divya D M / Asst.Professor

Date: 27.07.2023
1.Table of Contents

S.NO Topic Page No.

1. Contents 5

2. Course Objectives 6

3. Pre-Requisites 7

4. Syllabus 8

5. Course outcomes 11

6. CO- PO/PSO Mapping 12

7. Lecture Plan 14

8. Activity based learning 16

9. Lecture notes 17

10. Assignments 159

11. Part A Q & A 161

12. Part B Qs 163

13. Supportive online Certification courses 164

14. Real time Applications in day to day life 165


and to Industry

15 Contents beyond the Syllabus 167

16 Assessment Schedule 169

17 Prescribed Text Books and Reference Books 170

18 Mini Project Suggestions 171


2.Course Objectives

➢ To explain the fundamentals of data science


➢ To experiment and implement python libraries for data science
➢ To apply and implement basic classification algorithms
➢ To apply clustering and outlier detection approaches.
➢ To present and interpret data using visualization tools in Python
3.Pre-Requisites

Semester-III

Data Science using


Python

Semester-II

Python Programming

Semester-I

C Programming
4.SYLLABUS
22AI302 DATA SCIENCE USING PYTHON LTPC

2023

UNIT I INTRODUCTION 6+6

Data Science: Benefits and uses – facets of data - Data Science Process: Overview –
Defining research goals – Retrieving data – data preparation - Exploratory Data
analysis – build the model – presenting findings and building applications - Data
Mining - Data Warehousing – Basic statistical descriptions of Data.

List of Exercise/Experiments:

1. Download, install and explore the features of R/Python for data analytics
• Installing Anaconda
• Basic Operations in Jupyter Notebook
• Basic Data Handling

UNIT II PYTHON LIBRARIES FOR DATA SCIENCE 6+6


Introduction to Numpy - Multidimensional Ndarrays – Indexing – Properties –
Constants – Data Visualization: Ndarray Creation – Matplotlib - Introduction to
Pandas – Series – Dataframes – Visualizing the Data in Dataframes - Pandas Objects
– Data Indexing and Selection – Handling missing data – Hierarchical indexing –
Combining datasets – Aggregation and Grouping – Joins- Pivot Tables - String
operations – Working with time series – High performance Pandas.

List of Exercise/Experiments:

1. Working with Numpy arrays - Creation of numpy array using the tuple, Determine
the size, shape and dimension of the array, Manipulation with array Attributes,
Creation of Sub array, Perform the reshaping of the array along the row vector and
column vector, Create Two arrays and perform the concatenation among the arrays.

2. Working with Pandas data frames - Series, DataFrame , and Index, Implement the
Data Selection Operations, Data indexing operations like: loc, iloc, and ix, operations
of handling the missing data like None, Nan, Manipulate on the operation of Null
Values (is null(), not null(), dropna(), fillna()).
4.SYLLABUS
3. Perform the Statistics operation for the data (the sum, product, median, minimum
and maximum, quantiles, arg min, arg max etc.).
4. Use any data set compute the mean ,standard deviation, Percentile.

UNIT III CLASSIFICATION 6+6


Basic Concepts – Decision Tree Induction – Bayes Classification Methods –
Rule-Based Classification – Model Evaluation and Selection
Bayesian Belief Networks – Classification by Back propagation – Support Vector
Machines – Associative Classification – K-Nearest-Neighbor Classifiers – Fuzzy Set
Approaches - Multiclass Classification - Semi-Supervised Classification.

List of Exercise/Experiments:
1. Apply Decision Tree algorithms on any data set.
2. Apply SVM on any data set
3. Implement K-Nearest-Neighbor Classifiers

UNIT IV CLUSTERING AND OUTLIER DETECTION 6+6

Cluster Analysis – Partitioning Methods – Evaluation of Clusters – Probabilistic


Model-Based Clustering – Outliers and Outlier Analysis – Outlier Detection Methods –
Statistical Approaches – Clustering and Classification-Based Approaches.

List of Exercise/Experiments:
1. Apply K-means algorithms for any data set.
2. Perform Outlier Analysis on any data set.

UNIT V DATA VISUALIZATION 6+6


Importing Matplotlib – Simple line plots – Simple scatter plots – visualizing errors –
density and contour plots – Histograms – legends – colors – subplots – text and
annotation – customization - three dimensional plotting - Geographic Data with
Basemap - Visualization with Seaborn.

List of Exercise/Experiments:
1. Basic plots using Matplotlib.
2. Implementation of Scatter Plot.
3. Construction of Histogram, bar plot, Subplots, Line Plots.
4.SYLLABUS

4. Implement the three dimensional potting.


5. Visualize a dataset with Seaborn.
TOTAL:30+30=60 PERIODS

TEXTBOOKS:

1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data


Science”, Manning Publications, 2016.
2. Ashwin Pajankar, Aditya Joshi, Hands-on Machine Learning with Python:
Implement Neural Network Solutions with Scikit-learn and PyTorch, Apress,
2022.
3. Jake VanderPlas, “Python Data Science Handbook – Essential tools for
working with data”, O’Reilly, 2017.

REFERENCES:
1. Roger D. Peng, R Programming for Data Science, Lulu.com, 2016
2. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques",
3rd Edition, Morgan Kaufmann, 2012.
3. Samir Madhavan, Mastering Python for Data Science, Packt Publishing, 2015
4. Laura Igual, Santi Seguí, "Introduction to Data Science: A Python Approach to
Concepts,
5. Techniques and Applications", 1st Edition, Springer, 2017
6. Peter Bruce, Andrew Bruce, "Practical Statistics for Data Scientists: 50 Essential
7. Concepts", 3rd Edition, O'Reilly, 2017
8. Hector Guerrero, “Excel Data Analysis: Modelling and Simulation”, Springer
International Publishing, 2nd Edition, 2019

NPTEL Courses:
a. Data Science for Engineers - https://onlinecourses.nptel.ac.in/noc23_cs17/preview
b. Python for Data Science - https://onlinecourses.nptel.ac.in/noc23_cs21/preview

LIST OF EQUIPMENTS:
Systems with Anaconda, Jupyter Notebook, Python, Pandas, NumPy, MatPlotlib
5.COURSE OUTCOMES

At the end of this course, the students will be able to:

COURSE OUTCOMES HKL

CO1 Explain the fundamentals of data science. K2

CO2 Experiment python libraries for data science. K3

CO3 Apply and implement basic classification algorithms. K3

CO4 Implement clustering and outlier detection approaches. K4

CO5 Present and interpret data using visualization tools in Python. K3


CO – PO/PSO Mapping
6.CO – PO /PSO Mapping Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03
1 3 2 2 1 1 1 1 1 2 3 3

2 3 3 3 3 1 1 1 1 2 3 3

3 3 3 3 3 3 3 3 3 2 3 3

4 3 3 3 3 3 3 3 3 2 3 3

5 3 3 3 3 3 3 3 3 2 3 3
Lecture Plan
Unit - II
LECTURE PLAN – Unit 2- PYTHON LIBRARIES FOR DATA SCIENCE
Sl. Topic Numbe Proposed Actual CO Tax Mode of
No r of Date Lecture ono Delivery
. Periods Date my
Leve
l
Introduction to
Numpy - PPT / Chalk
1 1 26.08.2023 CO2 K2
Multidimensional & Talk
Ndarrays
Indexing –
PPT / Chalk
2 Properties-Const 1 28.08.2023 CO2 K2
& Talk
ants –
Data
Visualization:
PPT / Chalk
3 Ndarray Creation 1 29.08.2023 CO2 K2
& Talk
– Matplotlib

Introduction to PPT / Chalk


4 1 30.08.2023 CO2 K2
Pandas – Series & Talk
Dataframes –
Visualizing the
PPT / Chalk
5 Data in 1 31.08.2023 CO2 K2
& Talk
Dataframes

Pandas Objects
PPT / Chalk
6 – Data Indexing 1 01.09.2023 CO2 K2
& Talk
and Selection
Handling missing
data - PPT / Chalk
7. 1 02.09.2023 CO2 K2
Hierarchical & Talk
indexing –
Combining
datasets PPT / Chalk
8. 1 04.09.2023 CO2 K2
-Aggregation & Talk
and Grouping
Joins-
PPT / Chalk
9. Pivot Tables - 1 05.09.2023 CO2 K2
& Talk
String operations
Working with
time series –
PPT / Chalk
10. High 1 07.09.2023 CO2 K2
& Talk
performance
Pandas.
8. ACTIVITY BASED LEARNING

Activity name:

Creating and Automating an Interactive Dashboard using Python

Students will have better understanding about how the python libraries and
other features of python work with any datasets.

The steps involved in this activity are:

1. Downloading daily updated data from the web using selenium

2. Updating data directories using shutil, glob, and os python libraries

3. Simple cleaning of excel files with pandas

4. Formatting time series data frames to be input into plotly graphs

5. Creating a local web page for your dashboard using dash

Guidelines to do an activity :

1) Students can form group. ( 3 students / team)

2) Take any dataset.

3) Import python libraries. ( Follow above mentioned steps )

4) Conduct Peer review. ( each team will be reviewed by all other teams and
mentors )

Useful links:

https://towardsdatascience.com/creating-and-automating-an-interactive-dashb

oard-using-python-5d9dfa170206

https://github.com/tsbloxsom/Texas-census-county-data-project
UNIT-II
PYTHON LIBRARIES FOR DATA SCIENCE
9.LECTURE NOTES
1. Introduction to Numpy
• NumPy is the fundamental library for the numerical computation. It is an
integral part of the Scientific Python Ecosystem.
• NumPy is important because it is used to store the data. It has a basic yet
very versatile data structure known as Ndarray. It means N Dimensional
Array. Python has many array-like data structures (e.g., list). But Ndarray is
the most versatile and the most preferred structure to store scientific and
numerical data.
• Many libraries have their own data structures, and most of them use Ndarrays
as their base. And Ndarrays are compatible with many data structures and
routine just like the lists. Let us create a simple Ndarray as follows:
import numpy as np
lst1 = [1, 2, 3]
arr1 = np.array(lst1)

• Here, we are importing NumPy as an alias. Then, we are creating a list and
passing it as an argument to the function array().
• Let’s see the data types of all the variables used:
print(type(lst1))

print(type(arr1))

The output is as follows:

<class 'list'>
<class 'numpy.ndarray'>
Let’s see the contents of the Ndarray as follows:
arr1
The output is as follows:
array([1, 2, 3])
We can write it in a single line as follows:

arr1 = np.array([1, 2, 3])

We can specify the data type of the members of the Ndarray as follows:

arr1 = np.array([1, 2, 3], dtype=np.uint8)

2. Multidimensional Ndarrays

We can create multidimensional arrays as follows:

arr1 = np.array([[1, 2, 3], [4, 5, 6]], np.int16) arr1

The output is as follows:

array([[1, 2, 3],

[4, 5, 6]], dtype=int16)

This is a two-dimensional array. We can also create a multidimensional (3D array in


the following case) array as follows:

arr1 = np.array([[[1, 2, 3], [4, 5, 6]],

[[7, 8, 9], [0, 0, 0]],

[[-1, -1, -1], [1, 1, 1]]], np.int16)

arr1

The output is as follows:

array([[[ 1, 2, 3],

[ 4, 5, 6]],

[[ 7, 8, 9],

[ 0, 0, 0]],

[[-1, -1, -1],

[ 1, 1, 1]]], dtype=int16)
3. Indexing of Ndarrays

• We can address the elements (also called as the members) of the Ndarrays
individually. Let’s see how to do it with one-dimensional Ndarrays:

arr1 = np.array([1, 2, 3], dtype=np.uint8)

• We can address its elements as follows:

print(arr1[0])

print(arr1[1])

print(arr1[2])

• Just like lists, it follows C style indexing where the first element is at the
position of 0 and the nth element is at the position (n-1).
• We can also see the last element with negative location number as follows:

print(arr1[-1])

• The last but one element can be seen as follows:

print(arr1[-2])

• If we use an invalid index as follows:

print(arr1[3])

it throws the following error:

--------------------------------------------------------------------------

IndexError Traceback (most recent call last)


<ipython-input-24-20c8f9112e0b> in <module>

----> 1 print(arr1[3])

IndexError: index 3 is out of bounds for axis 0 with size 3


• Let’s create a 2D Ndarray as follows:

arr1 = np.array([[1, 2, 3], [4, 5, 6]], np.int16)

• We can also address elements of a 2D Ndarray:

print(arr1[0, 0]);

print(arr1[0, 1]);

print(arr1[0, 2]);

The output is as follows:

• We can access entire rows as follows:

print(arr1[0, :]) print(arr1[1, :])

• We can also access entire columns as follows:

print(arr1[:, 0]) print(arr1[:, 1])

• We can also extract the elements of a three-dimensional array as follows:

arr1 = np.array([[[1, 2, 3], [4, 5, 6]],

[[7, 8, 9], [0, 0, 0]],

[[-1, -1, -1], [1, 1, 1]]], np.int16)

• Let’s address the elements of the 3D array as follows:

print(arr1 [0, 0, 0])

print(arr1 [1, 1, 2])

print(arr1 [:, 1, 1])

We can access elements of Ndarrays this way


4. Ndarray Properties

We can learn more about the Ndarrays by referring to their properties.

Let us learn all the properties with the demonstration. Let us use the same 3D matrix
we used earlier:

x2 = np.array([[[1, 2, 3], [4, 5, 6]],[[0, -1, -2], [-3, -4, -5]]], np.int16)

We can know the number of dimensions with the following statement:

print(x2.ndim)

The output returns the number of dimensions:

We can know the shape of the Ndarray as follows:

print(x2.shape)

The shape means the size of the dimensions as follows:

(2, 2, 3)

We can know the data type of the members as follows:

print(x2.dtype)

The output is as follows:

int16

We can know the size (number of elements) and the number of bytes required in the
memory for the storage as follows:

print(x2.size)

print(x2.nbytes)

The output is as follows:

12

24

We can compute the transpose with the following code:

print(x2.T)
5. NumPy Constants

NumPy library has many useful mathematical and scientific constants we can use in
programs. The following code snippet prints all such important constants:

print(np.inf)

print(np.NAN)

print(np.NINF)

print(np.NZERO)

print(np.PZERO)

print(np.e)

print(np.euler_gamma)

print(np.pi)

The output is as follows:

inf

nan

-inf

-0.0

0.0

2.718281828459045

0.5772156649015329

3.141592653589793
6. Data Visualization: Numpy routines for Ndarray Creation

The routine np.empty() creates an empty array of given size. The elements of the
array are random as the array is not initialized.

import numpy as np

x = np.empty([3, 3], np.uint8)

print(x)

It will output an array with random numbers. And the output maybe different in your
case as the numbers are random. We can create multidimensional matrices as
follows:

x = np.empty([3, 3, 3], np.uint8)

print(x)

We can use the routine np.eye() to create a matrix of all zeros except the diagonal
elements of all the zeros. The diagonal has all the ones.

y = np.eye(4, dtype=np.uint8)

print(y)

The output is as follows:

[[1 0 0 0]

[0 1 0 0]

[0 0 1 0]

[0 0 0 1]]

We can also set the position of the diagonal as follows:

y = np.eye(4, dtype=np.uint8, k=1)

print(y)
The output is as follows:

[[0 1 0 0]

[0 0 1 0]

[0 0 0 1]

[0 0 0 0]]

We can even have the negative value for the position of the diagonal with all ones as
follows:

y = np.eye(4, dtype=np.uint8, k=-1)

print(y)

Run it and see the output.

The function np.identity() returns an identity matrix of the specified size. An identity
matrix is a matrix where all the elements at the diagonal are 1 and the rest of the
elements are 0. The following are a few examples of that:

x = np.identity(3, dtype= np.uint8)

print(x)

x = np.identity(4, dtype= np.uint8)

print(x)

The routine np.ones() returns the matrix of the given size that has all the elements
as ones. Run the following examples to see it in action:

x = np.ones((3, 3, 3), dtype=np.int16)

print(x)

x = np.ones((1, 1, 1), dtype=np.int16)

print(x)
Let us have a look at the routine arange(). It creates a Ndarray of evenly spaced
values with the given interval. An argument for the stop value is compulsory. The
start value and interval parameters have default arguments 0 and 1, respectively.
Let us see an example:

np.arange(10)

The output is as follows:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The routine linspace() returns a Ndarray of evenly spaced numbers over a specified
interval. We must pass it the starting value, the end value, and the number of values
as follows:

np.linspace(0, 20, 30)

The output is as follows:

Similarly, we can create Ndarrays with logarithmic spacing as follows:

np.logspace(0.1, 2, 10)

The output is as follows:

We can also create Ndarrays with the geometric spacing:

np.geomspace(0.1, 20, 10)


6.1 Matplotlib Data Visualization

• Matplotlib is a data visualization library. It is an integral part of the Scientific


Python Ecosystem. Many other data visualization libraries are just wrappers
on Matplotlib.
• Matplotlib’s Pyplot module provides MATLAB like interface. Let’s begin by
writing the demonstration programs.

The following command is known as magic command that enables Jupyter Notebook
to show Matplotlib visualizations:

%matplotlib inline

We can import the Pyplot module of Matplotlib as follows:

import matplotlib.pyplot as plt

We can draw a simple linear plot as follows:

x = np.arange(10)

y=x+1

plt.plot(x, y)

plt.show()

The output is shown in Figure 3-1.


We can have a multiline plot as follows:

x = np.arange(10)

y1 = 1 – x

plt.plot(x, y, x, y1)

plt.show()

The output is shown in Figure 3-2.

As we can see, the routine plt.plot() can visualize data as simple lines. We can also
plot data of other forms with it. The limitation is that it must be single dimensional.
Let’s draw a sine wave as follows:

n=3

t = np.arange(0, np.pi*2, 0.05)

y = np.sin( n * t )

plt.plot(t, y)

plt.show()
The output is shown in Figure 3-3.

We can also have other types of plots. Let’s visualize a bar plot.
n=5

x = np.arange(n)

y = np.random.rand(n)

plt.bar(x, y)

plt.show()

The output is as shown in Figure 3-4.


We can rewrite the same code in an object-oriented way as follows:

fig, ax = plt.subplots()

ax.bar(x, y)

ax.set_title('Bar Graph')

ax.set_xlabel('X')

ax.set_ylabel('Y')

plt.show()

As we can see, the code creates a figure and an axis that we can use to call
visualization routines and to set the properties of the visualizations.

Let’s see how to create subplots. Subplots are the plots within the visualization. We
can create them as follows:

x = np.arange(10)

plt.subplot(2, 2, 1)

plt.plot(x, x)

plt.title('Linear')

plt.subplot(2, 2, 2)

plt.plot(x, x*x)

plt.title('Quadratic')

plt.subplot(2, 2, 3)

plt.plot(x, np.sqrt(x))

plt.title('Square root')

plt.subplot(2, 2, 4)

plt.plot(x, np.log(x))
plt.title('Log')

plt.tight_layout()

plt.show()

As we can see, we are creating a subplot before each plotting routine call. The
routine tight_layout() creates enough spacing between subplots. The output is as
shown in Figure 3-5.

We can write the same code in the object-oriented fashion as follows:

fig, ax = plt.subplots(2, 2)

ax[0][0].plot(x, x)

ax[0][0].set_title('Linear')

ax[0][1].plot(x, x*x)

ax[0][1].set_title('Quadratic')

ax[1][0].plot(x, np.sqrt(x))
ax[1][0].set_title('Square Root')

ax[1][1].plot(x, np.log(x))

ax[1][1].set_title('Log')

plt.subplots_adjust(left=0.1,

bottom=0.1,

right=0.9,

top=0.9,

wspace=0.4,

hspace=0.4)

plt.show()

Let’s move ahead with the scatter plot. We can visualize 2D data as scatter plot as
follows:

n = 100

x = np.random.rand(n)

y = np.random.rand(n)

plt.scatter(x, y)

plt.show()

The output is as shown in Figure 3-6.


The graphical depiction of frequency distribution of any data is known as histogram.
We can easily create histograms with Matplotlib as follows:

mu, sigma = 0, 0.1

x = np.random.normal(mu, sigma, 1000)

plt.hist(x)

plt.show()

Here, mu means mean, and sigma means standard deviation. The output is as
shown in Figure 3-7.

Let’s conclude with a pie chart.

x = np.array([10, 20, 30, 40])

plt.pie(x)

plt.show()

The output is as shown in Figure 3-8.


7. Introduction to Pandas

• Pandas is the data analytics and data science library of the Scientific Python
Ecosystem. Just like NumPy, Matplotlib, IPython, and Jupyter Notebook, it is
an integral part of the ecosystem.
• It is used for storage, manipulation, and visualization of multidimensional
data. It is more flexible than Ndarrays and also compatible with it. It means
that we can use Ndarrays to create Pandas data structures.
• Let’s create a new notebook for the demonstrations in this chapter. We can
install Pandas with the following command in the Jupyter Notebook session:

!pip3 install pandas

The following code imports the library to the current program or Jupyter Notebook
session:

import pandas as pd

7.1 Series in Pandas

• A Pandas series is a homogeneous one-dimensional array with an index.


• It can store the data of any supported type. We can use lists or Ndarrays to
create series in Pandas. Let’s create a new notebook for demonstrations in the
chapter. Let’s import all the needed libraries:

%matplotlib inline

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

Let’s create a simple series using list as follows:

s1 = pd.Series([1, 2, 3 , 4, 5])
If we type the following code:

type(s1)

we get the following output:

pandas.core.series.Series

We can also create a series with the following code:

s2 = pd.Series(np.arange(5), dtype=np.uint8)

s2

The output is as follows:

0 0

1 1

2 2

3 3

4 4

dtype: uint8

• The first column is the index, and the second column is the data column. We
can create a series by using an already defined Ndarray as follows,

arr1 = np.arange(5, dtype=np.uint8)

s3 = pd.Series(arr1, dtype=np.int16)

s3

In this case, the data type of the series will be considered as the final data type.
7.2 Properties of Series

We can check the properties of series as follows.

We can check the values of the members of the series as follows:

s3.values

The output is as follows:

array([0, 1, 2, 3, 4], dtype=int16)

We can also check the values of the series with the following code:

s3.array

The output is as follows:

<PandasArray>

[0, 1, 2, 3, 4]

Length: 5, dtype: int16

We can check the index of the series:

s3.index

The following is the output:

RangeIndex(start=0, stop=5, step=1)

We can check the datatype as follows:

s3.dtype

We can check the shape as follows:

s3.shape

We can check the size as follows:

s3.size
We can check the number of bytes as follows:

s3.nbytes

And we can check the dimensions as follows:

s3.ndim

7.3 Pandas Dataframes

• We can use a two-dimensional indexed and built-in data structure of Pandas


known as dataframe.
• We can create dataframes from series, Ndarrays, lists, and dictionaries. If you
have ever worked with relational databases, then you can consider
dataframes analogous to tables in the databases.
• Let’s see how to create a dataframe. Let’s create a dictionary of population
data for cities as follows:

data = {'city': ['Bangalore', 'Bangalore', 'Bangalore',

'Mumbai', 'Mumbai', 'Mumbai'],

'year': [2020, 2021, 2022, 2020, 2021, 2022,],

'population': [10.0, 10.1, 10.2, 5.2, 5.3, 5.5]}

We can create a dataframe using this dictionary:

df1 = pd.DataFrame(data)

print(df1)

The output is as follows:


We can see the first five records of the dataframe directly with the following code:

df1.head()

Run this and see the output. We can also create the dataframe with a specific order
of columns as follows:

df2 = pd.DataFrame(data, columns=['year', 'city', 'population'])


print(df2)

The output is as follows:

As we can see, the order of columns is different this time.

8. Visualizing the Data in Dataframes

• We have learned the data visualization of NumPy data with the data
visualization library Matplotlib.
• Now, we will learn how to visualize Pandas data structures.
• Objects of Pandas data structures call Matplotlib visualization functions like
plot(). Basically, Pandas provides a wrapper for all these functions. Let us see
a simple example as follows:

df1 = pd.DataFrame()

df1['A'] = pd.Series(list(range(100)))

df1['B'] = np.random.randn(100, 1)

df1
So this code creates a dataframe. Let’s plot it now:

df1.plot(x='A', y='B')

plt.show()

The output is as shown in Figure 4-1.

• Now let’s explore the other plotting methods. We will create a dataset of four
columns.
• The columns will have random data generated with NumPy. So your output
will be definitely different.
• We will use the generated dataset for the rest of the examples. So let’s
generate the dataset:

df2 = pd.DataFrame(np.random.rand(10, 4),

columns=['A', 'B', 'C', 'D'])

print(df2)
It generates data like below,

Let us plot bar graphs as follows:

df2.plot.bar()

plt.show()

The output is as shown in Figure 4-2.


We can plot these graphs horizontally too as follows:

df2.plot.barh()

plt.show()

The output is as shown in Figure 4-3.

These bar graphs were unstacked. We can stack them up as follows:

df2.plot.bar(stacked = True)

plt.show()

The output is as shown in Figure 4-4.


We can have horizontal stacked bars as follows:

df2.plot.barh(stacked = True)

plt.show()

The output is as shown in Figure 4-5.

Histograms are a visual representation of the frequency distribution of data. We can


plot a simple histogram as follows:

df2.plot.hist(alpha=0.7)

plt.show()

The output is as shown in Figure 4-6.


We can have a stacked histogram as follows:

df2.plot.hist(stacked=True, alpha=0.7)

plt.show()

The output is as shown in Figure 4-7.

We can also customize buckets (also known as bins) as follows:

df2.plot.hist(stacked=True, alpha=0.7, bins=20)

plt.show()

The output is as shown in Figure 4-8.


We can also draw box plots as follows:

df2.plot.box()

plt.show()

The output is as shown in Figure 4-9.

We can draw an area plot as follows:

df2.plot.area()

plt.show()

The output is as shown in Figure 4-10.


We can draw an unstacked area plot as follows:

df2.plot.area(stacked=False)

plt.show()

The output is as shown in Figure 4-11.

9. Pandas Objects

• Pandas objects can be thought of as enhanced versions of NumPy structured


arrays in which the rows and columns are identified with labels rather than
simple integer indices.
• Pandas provides a host of useful tools, methods, and functionality on top of
the basic data structures.

9.1 The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a


list or array as follows:

data = pd.Series([0.25, 0.5, 0.75, 1.0])

data
Output:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
In the output, the Series wraps both a sequence of values and a sequence of
indices, which we can access with the values and index attributes. The values are
simply a familiar NumPy array:
data.values
Output:
array([ 0.25, 0.5 , 0.75, 1. ])
The index is an array-like object of type pd.Index
data.index
Output:
RangeIndex(start=0, stop=4, step=1)
Like with a NumPy array, data can be accessed by the associated index via the
familiar Python square-bracket notation:
data[1]
Output:
0.5

Data[1:3]
Output:
1 0.50
2 0.75
dtype: float64
Pandas Series is much more general and flexible than the one-dimensional NumPy
array.
9.2 Series as generalized NumPy array

While the Numpy Array has an implicitly defined integer index used to access the
values, the Pandas Series has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For
example, the index need not be an integer, but can consist of values of any desired
type. For example, if we wish, we can use strings as an index:

data = pd.Series([0.25, 0.5, 0.75, 1.0],


index=['a', 'b', 'c', 'd'])
data
Output:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
The items can be accessed as:

data['b']
Output:
0.5
We can even use non-contiguous or non-sequential indices:

data = pd.Series([0.25, 0.5, 0.75, 1.0],


index=[2, 5, 3, 7])
data
Output:
2 0.25
5 0.50
3 0.75
7 1.00
dtype: float64
data[5]
Output:
0.5
9.3 Series as specialized dictionary

Pandas Series makes it much more efficient than Python dictionaries for certain
operations.

The Series-as-dictionary analogy can be made even more clear by constructing a


Series object directly from a Python dictionary:

population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127,


'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)
population
Output:
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64
By default, a Series will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:
population['California']
Output:
38332521
The Series also supports array-style operations such as slicing:

population['California':'Illinois']
Output:
California 38332521
Florida 19552860
Illinois 12882135
dtype: int64
9.4 Constructing Series objects

pd.Series(data, index=index)

where index is an optional argument, and data can be one of many entities.

For example, data can be a list or NumPy array, in which case index defaults to an
integer sequence:

pd.Series([2, 4, 6])

Output:

0 2

1 4

2 6

dtype: int64

data can be a scalar, which is repeated to fill the specified index:

pd.Series(5, index=[100, 200, 300])

Output:

100 5

200 5

300 5

dtype: int64

data can be a dictionary, in which index defaults to the sorted dictionary keys:

pd.Series({2:'a', 1:'b', 3:'c'})

Output:

1 b

2 a

3 c

dtype: object
In each case, the index can be explicitly set if a different result is preferred:

pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

Output:

3 c

2 a

dtype: object

In this case, the Series is populated only with the explicitly identified keys.

10. The Pandas DataFrame Object

10.1 DataFrame as a generalized NumPy array

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame


is an analog of a two-dimensional array with both flexible row indices and flexible
column names.

A Data Frame is a sequence of aligned Series objects. Here, by "aligned" we mean


that they share the same index.
To demonstrate this we construct a new Series listing the area of each of the five
states:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
Output:
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
As we have this along with the population Series from before, we can use a
dictionary to construct a single two-dimensional object containing this information:

states = pd.DataFrame({'population': population,

'area': area})

states

Output:

Like the Series object, the DataFrame has an index attribute that gives access to the
index labels:

states.index

Output:

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

Additionally, the DataFrame has a columns attribute, which is an Index object


holding the column labels:

states.columns

Output:

Index(['area', 'population'], dtype='object')

Thus the DataFrame can be thought of as a generalization of a two-dimensional


NumPy array, where both the rows and columns have a generalized index for
accessing the data.
10.2 DataFrame as specialized dictionary

We can think of a DataFrame as a specialization of a dictionary. Where a dictionary


maps a key to a value, a DataFrame maps a column name to a Series of column
data. For example, asking for the 'area' attribute returns the Series object containing
the areas.

states['area']
Output:
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
10.3 Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways.

● From a single Series object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be


constructed from a single Series:

pd.DataFrame(population, columns=['population'])

Output:
● From a list of dicts

Any list of dictionaries can be made into a DataFrame.

data = [{'a': i, 'b': 2 * i for i in range(3)]


pd.DataFrame(data)
Output:

Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e.
"not a number") values:

pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])


Output:

● From a dictionary of Series objects

A DataFrame can be constructed from a dictionary of Series objects as well:

pd.DataFrame({'population': population, 'area': area})


Output:

● From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a DataFrame with any


specified column and index names. If omitted, an integer index will be used for
each:
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'],index=['a', 'b', 'c'])

Output:

● From a NumPy structured array

A Pandas DataFrame operates much like a structured array and can be created
directly from one:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
Output:
array([(0, 0.0), (0, 0.0), (0, 0.0)], dtype=[('A', '<i8'), ('B', '<f8')])
pd.DataFrame(A)
Output:

11. The Pandas Index Object

Consider an Index from a list of integers:

ind = pd.Index([2, 3, 5, 7, 11])


ind
Output:
Int64Index([2, 3, 5, 7, 11], dtype='int64')

11.1 Index as immutable array

The Index in many ways operates like an array. For example, we can use standard
Python indexing notation to retrieve values or slices:
ind[1]

Output:

ind[::2]

Output:

Int64Index([2, 5, 11], dtype='int64')

Index objects also have many of the attributes familiar from NumPy arrays:

print(ind.size, ind.shape, ind.ndim, ind.dtype)

Output:

5 (5,) 1 int64

One difference between Index objects and NumPy arrays is that indices are
immutable–that is, they cannot be modified via the normal means:

ind[1] = 0

Output:

TypeError: Index does not support mutable operations

This immutability makes it safer to share indices between multiple DataFrames and
arrays, without the potential for side effects from inadvertent index modification.

11.2 Index as ordered set

Pandas objects are designed to facilitate operations such as joins across datasets,
which depend on many aspects of set arithmetic. The Index object follows many of
the conventions used by Python's built-in set data structure, so that unions,
intersections, differences, and other combinations can be computed in a familiar
way:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
Output:
Int64Index([3, 5, 7], dtype='int64')
indA | indB # union
Output:
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB # symmetric difference
Output:
Int64Index([1, 2, 9, 11], dtype='int64')
These operations may also be accessed via object methods, for example
indA.intersection(indB).

12.Data Indexing and Selection

12.1 Data Selection in Series

12.1.1 Series as dictionary


Like a dictionary, the Series object provides a mapping from a collection of keys to a
collection of values:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
Output:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
data['b']

Output:

0.5

We can also use dictionary-like Python expressions and methods to examine the
keys/indices and values:

'a' in data

Output:

True

data.keys()

Output:

Index(['a', 'b', 'c', 'd'], dtype='object')

list(data.items())

Output:

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Series objects can even be modified with a dictionary-like syntax. Just as we can extend
a dictionary by assigning to a new key, we can extend a Series by assigning to a new
index value:

data['e'] = 1.25

data

Output:

a 0.25

b 0.50

c 0.75

d 1.00

e 1.25

dtype: float64
12.1.2 Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style item selection
via the same basic mechanisms as NumPy arrays: that is, slices, masking, and fancy
indexing. Examples of these are as follows:

# slicing by explicit index


data['a':'c']
Output:
a 0.25
b 0.50
c 0.75
dtype: float64
# slicing by implicit integer index
data[0:2]
Output:

a 0.25

b 0.50

dtype: float64

# masking
data[(data > 0.3) & (data < 0.8)]
Output:

b 0.50

c 0.75

dtype: float64

# fancy indexing
data[['a', 'e']]
Output:
a 0.25
e 1.25
dtype: float64
When slicing with an explicit index (i.e., data['a':'c']), the final index is included in the
slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is
excluded from the slice.

12.1.3 Indexers: loc, iloc, and ix

If Series has an explicit integer index, an indexing operation such as data[1] will use
the explicit indices, while a slicing operation like data[1:3] will use the implicit
Python-style index.

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])

data

Output:
1 a
3 b
5 c
dtype: object
# explicit index when indexing

data[1]
Output:
'a'
# implicit index when slicing

data[1:3]
Output:
3 b
5 c
dtype: object
Because of this potential confusion in the case of integer indexes, Pandas provides
some special indexer attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing
interface to the data in the Series.

First, the loc attribute allows indexing and slicing that always references the explicit
index:

data.loc[1]
Output:

'a'

data.loc[1:3]
Output:

1 a

3 b

dtype: object

The iloc attribute allows indexing and slicing that always references the implicit
Python-style index:

data.iloc[1]
Output:

'b'

data.iloc[1:3]
Output:

3 b

5 c

dtype: object
12.2 Data Selection in DataFrame

12.2.1 DataFrame as a dictionary

area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297,


'Florida': 170312, 'Illinois': 149995})

pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127,


'Florida': 19552860, 'Illinois': 12882135})

data = pd.DataFrame({'area':area, 'pop':pop})

data

Output:

The individual Series that make up the columns of the DataFrame can be accessed
via dictionary-style indexing of the column name:

data['area']
Output:
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
We can use attribute-style access with column names that are strings:
data.area

Output:

California 423967

Florida 170312

Illinois 149995

New York 141297

Texas 695662

Name: area, dtype: int64

This attribute-style column access actually accesses the exact same object as the
dictionary-style access:

data.area is data['area']

Output:

True

For example, if the column names are not strings, or if the column names conflict
with methods of the DataFrame, this attribute-style access is not possible. For
example, the DataFrame has a pop() method, so data.pop will point to this rather
than the "pop" column:

data.pop is data['pop']

Output:

False

Like with the Series objects, this dictionary-style syntax can also be used to modify
the object, in this case adding a new column:

data['density'] = data['pop'] / data['area']

data

Output:
12.2.2 DataFrame as two-dimensional array

We can view the DataFrame as an enhanced two-dimensional array. We can


examine the data array using the values attribute:

data.values

Output:

array([[ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01],


[ 1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
[ 1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
[ 1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
[ 6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])
we can transpose the full DataFrame to swap rows and columns:
data.T
Output:
Passing a single index to an array accesses a row:

data.values[0]

Output:

array([ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

Passing a single "index" to a DataFrame accesses a column:

data['area']

Output:

California 423967

Florida 170312

Illinois 149995

New York 141297

Texas 695662

Name: area, dtype: int64

Using the iloc indexer, we can index the underlying array as if it is a simple NumPy
array (using the implicit Python-style index), but the DataFrame index and column
labels are maintained in the result:

data.iloc[:3, :2]

Output:

Similarly, using the loc indexer we can index the underlying data in an array-like
style but using the explicit index and column names:

data.loc[:'Illinois', :'pop']
Output:

The ix indexer allows a hybrid of these two approaches:

data.ix[:3, :'pop']

Output:

In the loc indexer we can combine masking and fancy indexing as in the following:

data.loc[data.density > 100, ['pop', 'density']]

Output:

data.iloc[0, 2] = 90

data

Output:
12.2.3 Additional indexing conventions

While indexing refers to columns, slicing refers to rows:

data['Florida':'Illinois']

Output:

Such slices can also refer to rows by number rather than by index:

data[1:3]

Output:

Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

data[data.density > 100]

Output:

13. Handling Missing Data

13.1 Trade-Offs in Missing Data Conventions

There are two strategies to indicate the presence of missing data in a table or DataFrame.
That is using a mask that globally indicates missing values or choosing a sentinel value that
indicates a missing entry.

In the masking approach, the mask might be an entirely separate Boolean array, or it may
involve appropriation of one bit in the data representation to locally indicate the null status
of a value.
In the sentinel approach, the sentinel value could be some data-specific convention,
such as indicating a missing integer value with -9999 or some rare bit pattern, or it
could be a more global convention, such as indicating a missing floating-point value
with NaN (Not a Number), a special value which is part of the IEEE floating-point
specification.

None of these approaches is without trade-offs: use of a separate mask array


requires allocation of an additional Boolean array, which adds overhead in both
storage and computation. A sentinel value reduces the range of valid values that can
be represented, and may require extra (often non-optimized) logic in CPU and GPU
arithmetic. Common special values like NaN are not available for all data types.

As in most cases where no universally optimal choice exists, different languages and
systems use different conventions.

13.2 Missing Data in Pandas

Pandas chose to use sentinels for missing data and further chose to use two
already-existing Python null values: the special floating point NaN value and the
Python None object.

13.2.1 None: Pythonic missing data

The first sentinel value used by Pandas is None, a Python singleton object that is
often used for missing data in Python code. Because it is a Python object, None
cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data
type 'object' (i.e., arrays of Python objects):

import numpy as np

import pandas as pd

vals1 = np.array([1, None, 3, 4])

vals1

Output:

array([1, None, 3, 4], dtype=object)


This dtype=object means that the best common type representation NumPy could
infer for the contents of the array is that they are Python objects.

for dtype in ['object', 'int']:

print("dtype =", dtype)

%timeit np.arange(1E6, dtype=dtype).sum()

print()

Output:

dtype = object

10 loops, best of 3: 78.2 ms per loop

dtype = int

100 loops, best of 3: 3.06 ms per loop

The use of Python objects in an array also means that if we perform aggregations
like sum() or min() across an array with a None value, we will generally get an error:

vals1.sum()

Output:

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

The addition between an integer and None is undefined.

13.2.2 NaN: Missing numerical data

NaN (acronym for Not a Number) is a special floating-point value recognized by all
systems that use the standard IEEE floating-point representation:

vals2 = np.array([1, np.nan, 3, 4])

vals2.dtype
Output:
dtype('float64')
The result of arithmetic with NaN will be another NaN:
1 + np.nan
Output:
nan
0 * np.nan
Output:
nan
vals2.sum(), vals2.min(), vals2.max()
Output:
(nan, nan, nan)
NumPy does provide some special aggregations that will ignore these missing values:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
Output:
(8.0, 1.0, 4.0)
NaN is specifically a floating-point value. There is no equivalent NaN value for
integers, strings, or other types.
13.2.3 NaN and None in Pandas
pd.Series([1, np.nan, 2, None])
Output:
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
For types that don't have an available sentinel value, Pandas automatically
type-casts when NA values are present. For example, if we set a value in an integer
array to np.nan, it will automatically be upcast to a floating-point type to
accommodate the NA:

x = pd.Series(range(2), dtype=int)

Output:

0 0

1 1

dtype: int64

x[0] = None

Output:

0 NaN

1 1.0

dtype: float64

In addition to casting the integer array to floating point, Pandas automatically


converts the None to a NaN value.

The following table lists the upcasting conventions in Pandas when NA values are
introduced:
In Pandas, string data is always stored with an object dtype.

13.3 Operating on Null Values

Pandas treats None and NaN as essentially interchangeable for indicating missing or
null values. To facilitate this convention, there are several useful methods for
detecting, removing, and replacing null values in Pandas data structures. They are:

● isnull(): Generate a boolean mask indicating missing values


● notnull(): Opposite of isnull()
● dropna(): Return a filtered version of the data
● fillna(): Return a copy of the data with missing values filled or imputed

13.3.1 Detecting null values

Pandas data structures have two useful methods for detecting null data: isnull() and
notnull(). Either one will return a Boolean mask over the data. For example:

data = pd.Series([1, np.nan, 'hello', None])

data.isnull()

Output:

0 False

1 True

2 False

3 True

dtype: bool

Boolean masks can be used directly as a Series or DataFrame index:

data[data.notnull()]

Output:

0 1

2 hello

dtype: object

The isnull() and notnull() methods produce similar Boolean results for DataFrames.
13.3.2 Dropping null values

In addition to the masking used before, there are the convenience methods,
dropna() (which removes NA values) and fillna() (which fills in NA values). For a
Series, the result is straightforward:

data.dropna()
Output:
0 1
2 hello
dtype: object
Consider the following DataFrame:
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
Output:

We cannot drop single values from a DataFrame; we can only drop full rows or full
columns.

By default, dropna() will drop all rows in which any null value is present:

df.dropna()

Output:

Alternatively, we can drop NA values along a different axis; axis=1 drops all columns
containing a null value:

df.dropna(axis='columns')
Output:

The default is how='any', such that any row or column (depending on the axis
keyword) containing a null value will be dropped. We can also specify how='all',
which will only drop rows/columns that are all null values:

df[3] = np.nan

df

Output:

df.dropna(axis='columns', how='all')

Output:

The thresh parameter lets us to specify a minimum number of non-null values for
the row/column to be kept:

df.dropna(axis='rows', thresh=3)

Output:

Here the first and last row have been dropped, because they contain only two
non-null values.
13.3.3 Filling null values

Pandas provides the fillna() method, which returns a copy of the array with the null
values replaced.

Consider the following Series:

data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))


data
Output:
a 1.0
b NaN
c 2.0
d NaN
e 3.0
dtype: float64
We can fill NA entries with a single value, such as zero:

data.fillna(0)
Output:
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
We can specify a forward-fill to propagate the previous value forward: # forward-fill

data.fillna(method='ffill')
Output:
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
Or we can specify a back-fill to propagate the next values backward:

# back-fill

data.fillna(method='bfill')

Output:
a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64

For DataFrames, the options are similar, but we can also specify an axis along which
the fills take place:

df

Output:

df.fillna(method='ffill', axis=1)

Output:

if a previous value is not available during a forward fill, the NA value remains.

14. Hierarchical Indexing

We begin with the standard imports:

import pandas as pd

import numpy as np
14.1 A Multiply Indexed Series

14.1.1 Pandas MultiIndex

We can create a multi-index from the tuples as follows:

index = pd.MultiIndex.from_tuples(index)

index

Output:

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],

labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

The MultiIndex contains multiple levels of indexing–in this case, the state names and
the years, as well as multiple labels for each data point which encode these levels.

If we re-index our series with this MultiIndex, we see the hierarchical representation
of the data:

pop = pop.reindex(index)

pop

Output:

California 2000 33871648

2010 37253956

New York 2000 18976457

2010 19378102

Texas 2000 20851820

2010 25145561

dtype: int64

Here the first two columns of the Series representation show the multiple index
values, while the third column shows the data. Some entries are missing in the first
column: in this multi-index representation, any blank entry indicates the same value
as the line above it.
Now to access all data for which the second index is 2010, we can simply use the
Pandas slicing notation:
pop[:, 2010]
Output:
California 37253956
New York 19378102
Texas 25145561
dtype: int64
The result is a singly indexed array with just the keys we're interested in.
14.1.2 MultiIndex as extra dimension
The unstack() method will quickly convert a multiply indexed Series into a
conventionally indexed DataFrame:
pop_df = pop.unstack()
pop_df
Output:

The stack() method provides the opposite operation:

pop_df.stack()

Output:

California 2000 33871648


2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
• We were able to use multi-indexing to represent two-dimensional data within
a one-dimensional Series.
• We can also use it to represent data of three or more dimensions in a Series
or DataFrame.
• Each extra level in a multi-index represents an extra dimension of data; taking
advantage of this property gives us much more flexibility in the types of data
we can represent.
• Concretely, we might want to add another column of demographic data for
each state at each year (say, population under 18) ; with a MultiIndex this is
as easy as adding another column to the DataFrame:
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
Output:

14.2 Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed Series or DataFrame is


to simply pass a list of two or more index arrays to the constructor. For example:

df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
Output:
The work of creating the MultiIndex is done in the background.

Similarly, if we pass a dictionary with appropriate tuples as keys, Pandas will


automatically recognize this and use a MultiIndex by default:

data = {('California', 2000): 33871648,

('California', 2010): 37253956,

('Texas', 2000): 20851820,

('Texas', 2010): 25145561,

('New York', 2000): 18976457,

('New York', 2010): 19378102}

pd.Series(data)

Output:

California 2000 33871648

2010 37253956

New York 2000 18976457

2010 19378102

Texas 2000 20851820

2010 25145561

dtype: int64
14.2.1 Explicit MultiIndex constructors

We can construct the MultiIndex from a simple list of arrays giving the index values
within each level:

pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

Output:

MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

We can construct it from a list of tuples giving the multiple index values of each
point:

pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

Output:

MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

We can even construct it from a Cartesian product of single indices:

pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

Output:

MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Similarly, we can construct the MultiIndex directly using its internal encoding by
passing levels (a list of lists containing available index values for each level) and
labels (a list of lists that reference these labels):

pd.MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])


Output:

MultiIndex(levels=[['a', 'b'], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Any of these objects can be passed as the index argument when creating a Series or
Dataframe, or be passed to the reindex method of an existing Series or DataFrame.

14.2.2 MultiIndex level names

Sometimes it is convenient to name the levels of the MultiIndex. This can be


accomplished by passing the names argument to any of the above MultiIndex
constructors, or by setting the names attribute of the index after the fact:

pop.index.names = ['state', 'year']


pop
Output:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
With more involved datasets, this can be a useful way to keep track of the meaning
of various index values.

14.2.3 MultiIndex for columns

In a DataFrame, the rows and columns are completely symmetric, and just as the
rows can have multiple levels of indices, the columns can have multiple levels as
well.
Consider the following medical data:

# hierarchical indices and columns

index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],

names=['year', 'visit'])

columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],

names=['subject', 'type'])

# mock some data

data = np.round(np.random.randn(4, 6), 1)

data[:, ::2] *= 10

data += 37

# create the DataFrame

health_data = pd.DataFrame(data, index=index, columns=columns)

health_data

Output:

we can index the top-level column by the person's name and get a full DataFrame
containing just that person's information:

health_data['Guido']

Output:
For complicated records containing multiple labeled measurements across
multiple times for many subjects (people, countries, cities, etc.) use of
hierarchical rows and columns can be convenient.

14.3 Indexing and Slicing a MultiIndex

14.3.1 Multiply indexed Series

Consider the multiply indexed Series of state populations

pop
Output:
state year
California 2000 33871648
2010 37253956

New York 2000 18976457


2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
We can access single elements by indexing with multiple terms:

pop['California', 2000]

Output:

33871648
The MultiIndex also supports partial indexing, or indexing just one of the levels in
the index. The result is another Series, with the lower-level indices maintained:

pop['California']
Output:
year
2000 33871648
2010 37253956
dtype: int64
pop.loc['California':'New York']
Output:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64
With sorted indices, partial indexing can be performed on lower levels by passing an
empty slice in the first index:

pop[:, 2000]

Output:

state

California 33871648

New York 18976457

Texas 20851820

dtype: int64
selection based on Boolean masks:

pop[pop > 22000000]


Output:
state year
California 2000 33871648
2010 37253956
Texas 2010 25145561
dtype: int64
Selection based on fancy indexing also works:

pop[['California', 'Texas']]
Output:
state year
California 2000 33871648
2010 37253956
Texas 2000 20851820
2010 25145561
dtype: int64
14.3.2 Multiply indexed DataFrames
A multiply indexed DataFrame behaves in a similar manner. Consider the toy
medical DataFrame
health_data
Output:
Columns are primary in a DataFrame, and the syntax used for multiply indexed
Series applies to the columns. For example, we can recover Guido's heart rate data
with a simple operation:
health_data['Guido', 'HR']
Output:
year visit
2013 1 32.0
2 50.0
2014 1 39.0
2 48.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the loc, iloc, and ix indexers
health_data.iloc[:2, :2]
Output:

These indexers provide an array-like view of the underlying two-dimensional data,


but each individual index in loc or iloc can be passed a tuple of multiple indices. For
example:

health_data.loc[:, ('Bob', 'HR')]


Output:
year visit
2013 1 31.0
2 44.0
2014 1 30.0
2 47.0
Name: (Bob, HR), dtype: float64
Trying to create a slice within a tuple will lead to a syntax error:

health_data.loc[(:, 1), (:, 'HR')]

Output:

SyntaxError: invalid syntax

Pandas provides IndexSlice object to overcome the above condition:

idx = pd.IndexSlice

health_data.loc[idx[:, 1], idx[:, 'HR']]

Output:

14.4 Rearranging Multi-Indices

14.4.1 Sorted and unsorted indices

Many of the MultiIndex slicing operations will fail if the index is not sorted.

index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])

data = pd.Series(np.random.rand(6), index=index)

data.index.names = ['char', 'int']

data

Output:

char int
a 1 0.003001
2 0.164974
c 1 0.741650
2 0.569264
b 1 0.001693
2 0.526226
dtype: float64

If we try to take a partial slice of this index, it will result in an error:

try:

data['a':'b']
except KeyError as e:
print(type(e))
print(e)
Output:
<class 'KeyError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'

Partial slices and other similar operations require the levels in the MultiIndex to be in
sorted (i.e., lexographical) order. Pandas provides a number of convenience routines
to perform this type of sorting; examples are the sort_index() and sortlevel()
methods of the DataFrame

data = data.sort_index()

data

Output:

char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
c 1 0.741650
2 0.569264
dtype: float64
With the index sorted in this way, partial slicing will work as expected:

data['a':'b']

Output:

char int

a 1 0.003001

2 0.164974

b 1 0.001693

2 0.526226

dtype: float64

14.4.2 Stacking and unstacking indices


It is possible to convert a dataset from a stacked multi-index to a simple
two-dimensional representation, optionally specifying the level to use:
pop.unstack(level=0)
Output:

pop.unstack(level=1)
Output:
The opposite of unstack() is stack(), which here can be used to recover the original
series:
pop.unstack().stack()
Output:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64

14.4.3 Index setting and resetting


Another way to rearrange hierarchical data is to turn the index labels into columns;
this can be accomplished with the reset_index method. Calling this on the
population dictionary will result in a DataFrame with a state and year column
holding the information that was formerly in the index.
we can optionally specify the name of the data for the column representation:
pop_flat = pop.reset_index(name='population')
pop_flat

Output:
The set_index method of the DataFrame returns a multiply indexed DataFrame:
pop_flat.set_index(['state', 'year'])
Output:

14.5 Data Aggregations on Multi-Indices

Pandas has built-in data aggregation methods such as mean(), sum(), and max(). For
hierarchically indexed data, these can be passed a level parameter that controls which
subset of the data the aggregate is computed on.

For example, let's consider the health data:

health_data
Output:

data_mean = health_data.mean(level='year')
data_mean
Output:

Using the axis keyword, we can take the mean among levels on the columns as well:

data_mean.mean(axis=1, level='type')

Output:

15.Combining Datasets: Concat and Append

Concatenation of Series and DataFrame objects is very similar to concatenation of


Numpy arrays which can be done via the np.concatenate function. With this function,
we can combine the contents of two or more arrays into a single array:

x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
Output:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
The first argument is a list or tuple of arrays to concatenate. Additionally, it takes an
axis keyword that allows us to specify the axis along which the result will be
concatenated:

x = [[1, 2], [3, 4]]

np.concatenate([x, x], axis=1)


Output:

array([[1, 2, 1, 2],

[3, 4, 3, 4]])

15.1 Simple Concatenation with pd.concat

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but
contains a number of options.

# Signature in Pandas v0.18

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,

keys=None, levels=None, names=None, verify_integrity=False,

copy=True)

pd.concat() can be used for a simple concatenation of Series or DataFrame objects,


just as np.concatenate() can be used for simple concatenations of arrays:

ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])


ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
Output:
1 A
2 B
3 C
4 D
5 E
6 F
dtype: object
It also works to concatenate higher-dimensional objects, such as DataFrames:

df1 = make_df('AB', [1, 2])


df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
Output:

By default, the concatenation takes place row-wise within the DataFrame (i.e.,
axis=0). Like np.concatenate, pd.concat allows specification of an axis along which
concatenation will take place. Consider the following example:

df3 = make_df('AB', [0, 1])


df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")
Output:

We could have equivalently specified axis=1; here we've used the more intuitive
axis='col'.
15.1.1 Duplicate indices
One important difference between np.concatenate and pd.concat is that Pandas
concatenation preserves indices, even if the result will have duplicate indices.
Consider this example:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
Output:

15.1.2 Catching the repeats as an error

To verify that the indices in the result of pd.concat() do not overlap, we can specify
the verify_integrity flag. With this set to True, the concatenation will raise an
exception if there are duplicate indices. Here is an example, where for clarity we'll
catch and print the error message:

try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
Output:
ValueError: Indexes have overlapping values: [0, 1]

15.1.3 Ignoring the index

Sometimes the index itself does not matter and we would prefer it to simply be
ignored. This option can be specified using the ignore_index flag. With this set to
true, the concatenation will create a new integer index for the resulting Series:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')

Output:

15.1.4 Adding MultiIndex keys

Another option is to use the keys option to specify a label for the data sources; the
result will be a hierarchically indexed series containing the data:

display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")

Output:

The result is a multiply indexed DataFrame

15.1.5 Concatenation with joins

We were mainly concatenating DataFrames with shared column names. In practice,


data from different sources might have different sets of column names, and
pd.concat offers several options in this case. Consider the concatenation of the
following two DataFrames, which have some but not all columns in common:
df5 = make_df('ABC', [1, 2])

df6 = make_df('BCD', [3, 4])

display('df5', 'df6', 'pd.concat([df5, df6])')

Output:

By default, the entries for which no data is available are filled with NA values. To
change this, we can specify one of several options for the join and join_axes
parameters of the concatenate function. By default, the join is a union of the input
columns (join='outer'), but we can change this to an intersection of the columns
using join='inner':

display('df5', 'df6',

"pd.concat([df5, df6], join='inner')")

Output:

Another option is to directly specify the index of the remaining columns using the
join_axes argument, which takes a list of index objects. Here we will specify that the
returned columns should be the same as those of the first input:
display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")
Output:

15.1.6 The append() method

Because direct array concatenation is so common, Series and DataFrame objects


have an append method that can accomplish the same thing in fewer keystrokes. For
example, rather than calling pd.concat([df1, df2]), we can simply call
df1.append(df2):
display('df1', 'df2', 'df1.append(df2)')
Output:

Unlike the append() and extend() methods of Python lists, the append() method in
Pandas does not modify the original object, instead it creates a new object with the
combined data. It also is not a very efficient method, because it involves creation of
a new index and data buffer. Thus, if we want to do multiple append operations, it is
better to build a list of DataFrames and pass them all at once to the concat()
function.
16. Combining Datasets: Merge and Join

16.1 Categories of Joins

The pd.merge() function implements a number of types of joins: the one-to-one,


many-to-one and many-to-many joins. All three types of joins are accessed via an
identical call to the pd.merge() interface; the type of join performed depends on the
form of the input data.

16.1.1 One-to-one joins

Consider the following two DataFrames which contain information on several


employees in a company:

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],

'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],

'hire_date': [2004, 2008, 2012, 2014]})

display('df1', 'df2')

Output:

To combine this information into a single DataFrame, we can use the pd.merge()
function:

Output:
df3 = pd.merge(df1, df2)
df3
Output:

The pd.merge() function recognizes that each DataFrame has an "employee"


column, and automatically joins using this column as a key. The result of the merge
is a new DataFrame that combines the information from the two inputs.

The order of entries in each column is not necessarily maintained: in this case, the
order of the "employee" column differs between df1 and df2 and the pd.merge()
function correctly accounts for this.

16.1.2 Many-to-one joins

Many-to-one joins are joins in which one of the two key columns contains duplicate
entries. For the many-to-one case, the resulting DataFrame will preserve those
duplicate entries as appropriate. Consider the following example of a many-to-one
join:

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],

'supervisor': ['Carly', 'Guido', 'Steve']})

display('df3', 'df4', 'pd.merge(df3, df4)')

Output:
The resulting DataFrame has an additional column with the "supervisor" information,
where the information is repeated in one or more locations as required by the inputs.

16.1.3 Many-to-many joins

If the key column in both the left and right array contains duplicates, then the result
is a many-to-many merge.

Consider the following, where we have a DataFrame showing one or more skills
associated with a particular group. By performing a many-to-many join, we can
recover the skills associated with any individual person:

df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',

'Engineering', 'Engineering', 'HR', 'HR'],

'skills': ['math', 'spreadsheets', 'coding', 'linux',

'spreadsheets', 'organization']})

display('df1', 'df5', "pd.merge(df1, df5)")

Output:
16.2 Specification of the Merge Key

pd.merge() looks for one or more matching column names between the two inputs
and uses this as the key. However, often the column names will not match and
pd.merge() provides a variety of options for handling this.

16.2.1 The on keyword

We can explicitly specify the name of the key column using the on keyword, which
takes a column name or a list of column names:

display('df1', 'df2', "pd.merge(df1, df2, on='employee')")

Output:

This option works only if both the left and right DataFrames have the specified
column name.

16.2.2 The left_on and right_on keywords

At times we may want to merge two datasets with different column names. For
example, we may have a dataset in which the employee name is labeled as "name"
rather than "employee". In this case, we can use the left_on and right_on keywords
to specify the two column names:

df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],


'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')
The result has a redundant column that we can drop. For example, by using the
drop() method of DataFrames:

pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)

Output:

16.2.3 The left_index and right_index keywords

Sometimes, rather than merging on a column, we would like to merge on an index.


For example,
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')
Output:

We can use the index as the key for merging by specifying the left_index and/or
right_index flags in pd.merge():
display('df1a', 'df2a',
"pd.merge(df1a, df2a, left_index=True, right_index=True)")
Output:
If we want to mix indices and columns, we can combine left_index with right_on or
left_on with right_index to get the desired behavior:
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
Output:

16.3 Specifying Set Arithmetic for Joins

df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],


'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
Output:
Here we have merged two datasets that have only a single "name" entry in common:
Mary.
By default, the result contains the intersection of the two sets of inputs. This is what
is known as an inner join. We can specify this explicitly using the how keyword,
which defaults to "inner":
pd.merge(df6, df7, how='inner')
Output:

Other options for the how keyword are 'outer', 'left', and 'right'. An outer join returns
a join over the union of the input columns, and fills in all missing values with NAs:
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")
Output:

The left join and right join return joins over the left entries and right entries,
respectively. For example:
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
Output:
The output rows correspond to the entries in the left input. Using how='right' works
in a similar manner.

16.4 Overlapping Column Names: The suffixes Keyword

Here the two input DataFrames have conflicting column names. Consider this
example:
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')
Output:

Because the output would have two conflicting column names, the merge function
automatically appends a suffix _x or _y to make the output columns unique. If these
defaults are inappropriate, it is possible to specify a custom suffix using the suffixes
keyword:
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')
Output:

These suffixes work in any of the possible join patterns and work also if there are
multiple overlapping columns.

17. Aggregation and Grouping

Planets Data

Here we will use the Planets dataset, available via the Seaborn package. It gives
information on planets that astronomers have discovered around other stars.
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
Output:
(1035, 6)

planets.head()
Output:

This has some details on the 1,000+ extrasolar planets discovered up to 2014.
17.1 Simple Aggregation in Pandas
For a Pandas Series the aggregates return a single value:rng =
np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
Output:
0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64

ser.sum()
Output:
2.8119254917081569

ser.mean()
Output:
0.56238509834163142

For a DataFrame, by default the aggregates return results within each column:
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df
Output:

df.mean()
Output:
A 0.477888
B 0.443420
dtype: float64

By specifying the axis argument, we can instead aggregate within each row:
df.mean(axis='columns')
Output:
0 0.088290
1 0.513997
2 0.849309
3 0.406727
4 0.444949
dtype: float64
The method describe() computes several common aggregates for each column and
returns the result. We can use this on the Planets data for dropping rows with
missing values:
planets.dropna().describe()
Output:

The following table summarizes some other built-in Pandas aggregations:


These are all methods of DataFrame and Series objects.

The groupby operation allows us to quickly and efficiently compute aggregates on


subsets of data.

17.2 GroupBy: Split, Apply, Combine

Split, apply, combine


A canonical example of this split-apply-combine operation, where the "apply" is a
summation aggregation, is illustrated in this figure:

This makes clear what the groupby accomplishes:


● The split step involves breaking up and grouping a DataFrame depending on
the value of the specified key.
● The apply step involves computing some function, usually an aggregate,
transformation, or filtering, within the individual groups.
● The combine step merges the results of these operations into an output array.
We can use Pandas for the computation shown in this diagram. We'll start by
creating the input DataFrame:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df
Output:

The most basic split-apply-combine operation can be computed with the groupby()
method of DataFrames, passing the name of the desired key column:
df.groupby('key')
Output:
<pandas.core.groupby.DataFrameGroupBy object at 0x117272160>
what is returned is not a set of DataFrames, but a DataFrameGroupBy object.

To produce a result, we can apply an aggregate to this DataFrameGroupBy object,


which will perform the appropriate apply/combine steps to produce the desired
result:
df.groupby('key').sum()
Output:
17.2.1 The GroupBy object
The GroupBy object is a very flexible abstraction and the most important
operations made available by a GroupBy are aggregate, filter, transform, and
apply.

● Column indexing

The GroupBy object supports column indexing in the same way as the DataFrame
and returns a modified GroupBy object. For example:
planets.groupby('method')
Output:
<pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8>
planets.groupby('method')['orbital_period']
Output:
<pandas.core.groupby.SeriesGroupBy object at 0x117272da0>
Here we have selected a particular Series group from the original DataFrame group
by reference to its column name. As with the GroupBy object, no computation is
done until we call some aggregate on the object:
planets.groupby('method')['orbital_period'].median()
Output:
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000
Name: orbital_period, dtype: float64
This gives an idea of the general scale of orbital periods (in days)
● Iteration over groups

The GroupBy object supports direct iteration over the groups, returning each group
as a Series or DataFrame:
for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
Output:
Astrometry shape=(2, 6)
Eclipse Timing Variations shape=(9, 6)
Imaging shape=(38, 6)
Microlensing shape=(23, 6)
Orbital Brightness Modulation shape=(3, 6)
Pulsar Timing shape=(5, 6)
Pulsation Timing Variations shape=(1, 6)
Radial Velocity shape=(553, 6)
Transit shape=(397, 6)
Transit Timing Variations shape=(4, 6)

● Dispatch methods

Through some Python class magic, any method not explicitly implemented by the
GroupBy object will be passed through and called on the groups, whether they are
DataFrame or Series objects. For example, we can use the describe() method of
DataFrames to perform a set of aggregations that describe each group in the data:

planets.groupby('method')['year'].describe().unstack()

Output:
17.2.2 Aggregate, filter, transform, apply
GroupBy objects have aggregate(), filter(), transform(), and apply() methods that
efficiently implement a variety of useful operations before combining the grouped
data.
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
Output:

● Aggregation

aggregate() method can take a string, a function, or a list and compute all the
aggregates at once.
df.groupby('key').aggregate(['min', np.median, max])
Output:
Another useful pattern is to pass a dictionary mapping column names to operations
to be applied on that column:
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
Output:

● Filtering
A filtering operation allows us to drop data based on the group properties. For
example, we might want to keep all groups in which the standard deviation is larger
than some critical value:
def filter_func(x):
return x['data2'].std() > 4

display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")


Output:

The filter function should return a Boolean value specifying whether the group
passes the filtering. Here because group A does not have a standard deviation
greater than 4, it is dropped from the result.
● Transformation
While aggregation must return a reduced version of the data, transformation can
return some transformed version of the full data to recombine. For such a
transformation, the output is the same shape as the input. A common example is to
center the data by subtracting the group-wise mean:

df.groupby('key').transform(lambda x: x - x.mean())
Output:

● The apply() method


The apply() method lets us apply an arbitrary function to the group results. The
function should take a DataFrame, and return either a Pandas object (e.g.,
DataFrame, Series) or a scalar; the combine operation will be tailored to the type of
output returned.
For example, here is an apply() that normalizes the first column by the sum of the
second:
def norm_by_data2(x):

# x is a DataFrame of group values

x['data1'] /= x['data2'].sum()

return x

display('df', "df.groupby('key').apply(norm_by_data2)")

Output:
17.2.3 Specifying the split key
● A list, array, series, or index providing the grouping keys
The key can be any series or list with a length matching that of the DataFrame. For
example:
L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')
Output:

There is another df.groupby('key')


display('df', "df.groupby(df['key']).sum()")
Output:
A dictionary or series mapping index to group
Another method is to provide a dictionary that maps index values to the group keys:

df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')
Output:

● Any Python function


Similar to mapping, we can pass any Python function that will input the index value
and output the group:
display('df2', 'df2.groupby(str.lower).mean()')
Output:
● A list of valid keys
Any of the preceding key choices can be combined to group on a multi-index:
df2.groupby([str.lower, mapping]).mean()
Output:

17.2.4 Grouping example


In a couple lines of Python code we can put all these together and count discovered
planets by method and by decade:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)
Output:
This shows the power of combining many of the operations.

18. Pivot Tables


• A pivot table is a similar operation that is commonly seen in spreadsheets and
other programs that operate on tabular data.
• The pivot table takes simple column-wise data as input, and groups the
entries into a two-dimensional table that provides a multidimensional
summarization of the data.

18.1 Motivating Pivot Tables


• We'll use the database of passengers on the Titanic, available through the
Seaborn library
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()

This contains a wealth of information on each passenger of that ill-fated voyage,


including gender, age, class, fare paid, and much more.

18.2 Pivot Tables by Hand


• To start learning more about this data, we might begin by grouping according
to gender, survival status, or some combination thereof.
titanic.groupby('sex')[['survived']].mean()

• This immediately gives us some insight: overall, three of every four females
on board survived, while only one in five males survived!
• Using the vocabulary of GroupBy, we might proceed using something like this:
• we group by class and gender, select survival, apply a mean aggregate,
combine the resulting groups, and then unstack the hierarchical index to
reveal the hidden multidimensionality. In code:

titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
• This gives us a better idea of how both gender and class affected survival.
• This two-dimensional GroupBy is common enough that Pandas includes a
convenience routine, pivot_table, which succinctly handles this type of
multi-dimensional aggregation.

18.3 Pivot Table Syntax

Here is the equivalent to the preceding operation using the pivot_table method of
DataFrames:
titanic.pivot_table('survived', index='sex', columns='class')

This is eminently more readable than the groupby approach, and produces the same
result.

18.4 Multi-level pivot tables


• Just as in the GroupBy, the grouping in pivot tables can be specified with
multiple levels, and via a number of options.
• For example, we might be interested in looking at age as a third dimension.
• We'll bin the age using the pd.cut function:
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
We can apply the same strategy when working with the columns as well; let's add
info on the fare paid using pd.qcut to automatically compute quantiles:

fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])

The result is a four-dimensional aggregation with hierarchical indices.shown in a grid


demonstrating the relationship between the values.

Example: Birthrate Data

Let's take a look at the freely available data on births in the United States, provided
by the Centers for Disease Control (CDC). This data can be found at
https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv

# shell command to download the data:

# !curl -O
https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
births = pd.read_csv('data/births.csv')

Taking a look at the data, we see that it's relatively simple–it contains the number of
births grouped by date and gender:
births.head()

We can start to understand this data a bit more by using a pivot table. Let's add a
decade column, and take a look at male and female births as a function of decade:
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')

We immediately see that male births outnumber female births in every decade. To
see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to
visualize the total number of births by year
%matplotlib inline

import matplotlib.pyplot as plt

sns.set() # use Seaborn styles

births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()


plt.ylabel('total births per year');

With a simple pivot table and plot() method, we can immediately see the annual
trend in births by gender. By eye, it appears that over the past 50 years male births
have outnumbered female births by around 5%.

19. Vectorized String Operations

19.1 Introducing Pandas String Operations


Pandas includes features to address the need for vectorized string operations and for
correctly handling missing data via the str attribute of Pandas Series and Index
objects containing strings. So, for example, suppose we create a Pandas Series with
this data:
import pandas as pd
names = pd.Series(data)
names
Output:
0 peter
1 Paul
2 None
3 MARY
4 gUIDO
dtype: object

We can now call a single method that will capitalize all the entries, while skipping
over any missing values:
names.str.capitalize()
Output:
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object

19.2 Tables of Pandas String Methods


Consider the example using the following series of names:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])

19.2.1 Methods similar to Python string methods


A list of Pandas str methods that mirror Python string methods:
These have various return values. Some, like lower(), return a series of strings:
monte.str.lower()
Output;
0 graham chapman
1 john cleese
2 terry gilliam
3 eric idle
4 terry jones
5 michael palin
dtype: object

But some others return numbers:


monte.str.len()
Output:
0 14
1 11
2 13
3 9
4 11
5 13
dtype: int64

Or Boolean values:
monte.str.startswith('T')
Output:
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool

Still others return lists or other compound values for each element:
monte.str.split()
Output:
0 [Graham, Chapman]
1 [John, Cleese]
2 [Terry, Gilliam]
3 [Eric, Idle]
4 [Terry, Jones]
5 [Michael, Palin]
dtype: object
19.2.2 Methods using regular expressions

There are several methods that accept regular expressions to examine the content
of each string element, and follow some of the API conventions of Python's built-in
re module:

we can extract the first name from each by asking for a contiguous group of
characters at the beginning of each element:
monte.str.extract('([A-Za-z]+)', expand=False)
Output:
0 Graham
1 John
2 Terry
3 Eric
4 Terry
5 Michael
dtype: object
we can find all names that start and end with a consonant, making use of the
start-of-string (^) and end-of-string ($) regular expression characters:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Output:
0 [Graham Chapman]
1 []
2 [Terry Gilliam]
3 []
4 [Terry Jones]
5 [Michael Palin]
dtype: object

The ability to concisely apply regular expressions across Series or Dataframe


entries opens up many possibilities for analysis and cleaning of data.
19.2.3 Miscellaneous methods
There are some miscellaneous methods that enable other convenient operations:

● Vectorized item access and slicing


The get() and slice() operations, in particular, enable vectorized element access
from each array. For example, we can get a slice of the first three characters of
each array using str.slice(0, 3).
monte.str[0:3]
Output:
0 Gra
1 Joh
2 Ter
3 Eri
4 Ter
5 Mic
dtype: object
Indexing via df.str.get(i) and df.str[i] is likewise similar.
These get() and slice() methods also let us access elements of arrays returned by
split(). For example, to extract the last name of each entry, we can combine split()
and get():
monte.str.split().str.get(-1)

Output:
0 Chapman
1 Cleese
2 Gilliam
3 Idle
4 Jones
5 Palin
dtype: object
● Indicator variables
The get_dummies() method is useful when our data has a column containing some
sort of coded indicator. For example, we might have a dataset that contains
information in the form of codes, such as A="born in America," B="born in the
United Kingdom," C="likes cheese," D="likes spam":

full_monte = pd.DataFrame({'name': monte,


'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
Output:
The get_dummies() routine let us to split-out these indicator variables into a
DataFrame:
full_monte['info'].str.get_dummies('|')
Output:

20. Working with Time Series

20.1 Dates and Times in Python

● Time stamps reference particular moments in time (e.g., July 4th, 2015 at
7:00am).
● Time intervals and periods reference a length of time between a particular
beginning and end point; for example, the year 2015. Periods usually
reference a special case of time intervals in which each interval is of uniform
length and does not overlap (e.g., 24 hour-long periods comprising days).
● Time deltas or durations reference an exact length of time (e.g., a duration
of 22.56 seconds).

While the time series tools provided by Pandas tend to be the most useful for data
science applications, it is helpful to see their relationship to other packages used in
Python.
● Native Python dates and times: datetime and dateutil

Python's basic objects for working with dates and times reside in the built-in
datetime module. Along with the third-party dateutil module, we can use it to quickly
perform a host of useful functionalities on dates and times. For example, we can
manually build a date using the datetime type:

from datetime import datetime

datetime(year=2015, month=7, day=4)

Output:

datetime.datetime(2015, 7, 4, 0, 0)

Or using the dateutil module, we can parse dates from a variety of string formats:

from dateutil import parser

date = parser.parse("4th of July, 2015")

date

Output:

datetime.datetime(2015, 7, 4, 0, 0)

Once we have a datetime object, we can print the day of the week:

date.strftime('%A')

Output:

'Saturday'

("%A") is one of the standard string format codes for printing dates.

● Typed arrays of times: NumPy's datetime64

The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of
dates to be represented very compactly. The datetime64 requires a very specific
input format:
import numpy as np
date = np.array('2015-07-04', dtype=np.datetime64)
date
Output:
array(datetime.date(2015, 7, 4), dtype='datetime64[D]')
• Once we have this date formatted, we can do vectorized operations on it:
date + np.arange(12)
Output:
array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
'2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
'2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'], dtype='datetime64[D]')

NumPy will infer the desired unit from the input; for example, here is a day-based
datetime:
np.datetime64('2015-07-04')
Output:
numpy.datetime64('2015-07-04')
Here is a minute-based datetime:
np.datetime64('2015-07-04 12:00')
Output:
numpy.datetime64('2015-07-04T12:00')

The time zone is automatically set to the local time on the computer executing the
code. We can force any desired fundamental unit using one of many format codes;
for example, here we'll force a nanosecond-based time:
np.datetime64('2015-07-04 12:59:59.50', 'ns')
Output:
numpy.datetime64('2015-07-04T12:59:59.500000000')
The following table lists the available format codes along with the relative and
absolute time spans that they can encode:

For the types of data we see in the real world, a useful default is datetime64[ns], as
it can encode a useful range of modern dates with a suitably fine precision.

20.2 Dates and times in pandas

Pandas builds upon all the tools to provide a Timestamp object, which combines the
ease-of-use of datetime and dateutil with the efficient storage and vectorized
interface of numpy.datetime64. From a group of these Timestamp objects, Pandas
can construct a DatetimeIndex that can be used to index data in a Series or
DataFrame.

We can parse a flexibly formatted string date and use format codes to output the
day of the week:
import pandas as pd
date = pd.to_datetime("4th of July, 2015")
date
Output:
Timestamp('2015-07-04 00:00:00')
date.strftime('%A')
Output:
'Saturday'

Additionally, we can do NumPy-style vectorized operations directly on this same


object:

date + pd.to_timedelta(np.arange(12), 'D')


Output:
DatetimeIndex(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
'2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
'2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
dtype='datetime64[ns]', freq=None)

20.3 Pandas Time Series: Indexing by Time

With the Pandas time series tools, we can index data by timestamps. For example,
we can construct a Series object that has time indexed data:
index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
'2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data
Output:
2014-07-04 0
2014-08-04 1
2015-07-04 2
2015-08-04 3
dtype: int64

Now that we have this data in a Series, we can make use of any of the Series
indexing patterns.
data['2014-07-04':'2015-07-04']
Output:
2014-07-04 0
2014-08-04 1
2015-07-04 2
dtype: int64
There are additional special date-only indexing operations, such as passing a year
to obtain a slice of all data from that year:
data['2015']
Output:
2015-07-04 2
2015-08-04 3
dtype: int64

20.4 Pandas Time Series Data Structures

The fundamental Pandas data structures for working with time series data:
● For time stamps, Pandas provides the Timestamp type. It is essentially a
replacement for Python's native datetime, but is based on the more efficient
numpy.datetime64 data type. The associated Index structure is
DatetimeIndex.
● For time Periods, Pandas provides the Period type. This encodes a
fixed-frequency interval based on numpy.datetime64. The associated index
structure is PeriodIndex.
● For time deltas or durations, Pandas provides the Timedelta type. Timedelta is
a more efficient replacement for Python's native datetime.timedelta type and
is based on numpy.timedelta64. The associated index structure is
TimedeltaIndex.
The most fundamental of these date/time objects are the Timestamp and
DatetimeIndex objects.
While these class objects can be invoked directly, it is more common to use the
pd.to_datetime() function which can parse a wide variety of formats.
Passing a single date to pd.to_datetime() yields a Timestamp, passing a series of
dates by default yields a DatetimeIndex:
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
'2015-Jul-6', '07-07-2015', '20150708'])
dates

Output:

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',


'2015-07-08'],
dtype='datetime64[ns]', freq=None)
Any DatetimeIndex can be converted to a PeriodIndex with the to_period() function
with the addition of a frequency code. We will use 'D' to indicate daily frequency:
dates.to_period('D')
Output:
PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
'2015-07-08'],
dtype='int64', freq='D')

A TimedeltaIndex is created, for example, when a date is subtracted from another:


dates - dates[0]
Output:
TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'],
dtype='timedelta64[ns]', freq=None)
20.4.1 Regular sequences: pd.date_range()
To make the creation of regular date sequences more convenient, Pandas offers a
few functions for this purpose: pd.date_range() for timestamps, pd.period_range()
for periods, and pd.timedelta_range() for time deltas.
We have seen that Python's range() and NumPy's np.arange() turn a startpoint,
endpoint, and optional stepsize into a sequence.
Similarly, pd.date_range() accepts a start date, an end date, and an optional
frequency code to create a regular sequence of dates. By default, the frequency is
one day:

pd.date_range('2015-07-03', '2015-07-10')
Output:
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
'2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
dtype='datetime64[ns]', freq='D')

Alternatively, the date range can be specified not with a start and endpoint, but with
a startpoint and a number of periods:
pd.date_range('2015-07-03', periods=8)
Output:
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
'2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
dtype='datetime64[ns]', freq='D')
The spacing can be modified by altering the freq argument which defaults to D. For
example, here we will construct a range of hourly timestamps:
pd.date_range('2015-07-03', periods=8, freq='H')
Output:
DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
'2015-07-03 02:00:00', '2015-07-03 03:00:00',
'2015-07-03 04:00:00', '2015-07-03 05:00:00',
'2015-07-03 06:00:00', '2015-07-03 07:00:00'],
dtype='datetime64[ns]', freq='H')

To create regular sequences of Period or Timedelta values, the very similar


pd.period_range() and pd.timedelta_range() functions are useful. Some monthly
periods are:

pd.period_range('2015-07', periods=8, freq='M')


Output:
PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12',
'2016-01', '2016-02'],
dtype='int64', freq='M')

And a sequence of durations increasing by an hour:


pd.timedelta_range(0, periods=10, freq='H')
Output:
TimedeltaIndex(['00:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00',
'05:00:00', '06:00:00', '07:00:00', '08:00:00', '09:00:00'],
dtype='timedelta64[ns]', freq='H')

20.5 Frequencies and Offsets

Fundamental to these Pandas time series tools is the concept of a frequency or date
offset. Just as the D (day) and H (hour) codes, we can use such codes to specify any
desired frequency spacing. The following table summarizes the main codes available:
The monthly, quarterly, and annual frequencies are all marked at the end of the
specified period. By adding an S suffix to any of these, they instead will be marked
at the beginning:

Additionally, we can change the month used to mark any quarterly or annual code
by adding a three-letter month code as a suffix:

● Q-JAN, BQ-FEB, QS-MAR, BQS-APR, etc.


● A-JAN, BA-FEB, AS-MAR, BAS-APR, etc.

In the same way, the split-point of the weekly frequency can be modified by adding
a three-letter weekday code:

● W-SUN, W-MON, W-TUE, W-WED, etc.

On top of this, codes can be combined with numbers to specify other frequencies.
For example, for a frequency of 2 hours 30 minutes, we can combine the hour (H)
and minute (T) codes as follows:
pd.timedelta_range(0, periods=9, freq="2H30T")
Output:
TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00',
'12:30:00', '15:00:00', '17:30:00', '20:00:00'],
dtype='timedelta64[ns]', freq='150T')
All of these short codes refer to specific instances of Pandas time series offsets,
which can be found in the pd.tseries.offsets module. For example, we can create a
business day offset directly as follows:
from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())
Output:
DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-06',
'2015-07-07'],
dtype='datetime64[ns]', freq='B')

20.6 Resampling, Shifting, and Windowing

The ability to use dates and times as indices to intuitively organize and access data is
an important piece of the Pandas time series tools. The benefits of indexed data in
general (automatic alignment during operations, intuitive data slicing and access,
etc.) still apply and Pandas provides several additional time series-specific
operations.

For example, the accompanying pandas-datareader package (installable via conda


install pandas-datareader) knows how to import financial data from a number of
available sources, including Yahoo finance, Google Finance, and others. Here we will
load Google's closing price history:

from pandas_datareader import data

goog = data.DataReader('GOOG', start='2004', end='2016',


data_source='google')
goog.head()
Output:

For simplicity, we will use just the closing price:


goog = goog['Close']
We can visualize this using the plot() method, after the normal Matplotlib setup
boilerplate
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
goog.plot();
Output:
20.6.1 Resampling and converting frequencies

One common need for time series data is resampling at a higher or lower
frequency. This can be done using the resample() method or the much simpler
asfreq() method. The primary difference between the two is that resample() is
fundamentally a data aggregation, while asfreq() is fundamentally a data
selection.
Taking the Google closing price, let's compare what the two return when we
down-sample the data. Here we will resample the data at the end of business
year:

goog.plot(alpha=0.5, style='-')

goog.resample('BA').mean().plot(style=':')

goog.asfreq('BA').plot(style='--');

plt.legend(['input', 'resample', 'asfreq'],

loc='upper left');

Output:

At each point, resample reports the average of the previous year, while asfreq reports
the value at the end of the year.
For up-sampling, resample() and asfreq() are largely equivalent. In this case, the
default for both methods is to leave the up-sampled points empty, that is, filled with
NA values. Just as with the pd.fillna() function, asfreq() accepts a method argument
to specify how values are imputed. Here, we will resample the business day data at
a daily frequency (i.e., including weekends):

fig, ax = plt.subplots(2, sharex=True)

data = goog.iloc[:10]

data.asfreq('D').plot(ax=ax[0], marker='o')

data.asfreq('D', method='bfill').plot(ax=ax[1], style='-o')

data.asfreq('D', method='ffill').plot(ax=ax[1], style='--o')

ax[1].legend(["back-fill", "forward-fill"]);

Output:
The top panel is the default: non-business days are left as NA values and do not
appear on the plot.

The bottom panel shows the differences between two strategies for filling the gaps:
forward-filling and backward-filling.

20.6.2 Time-shifts

Another common time series-specific operation is shifting of data in time. Pandas has
two closely related methods for computing this: shift() and tshift(). The difference
between them is that shift() shifts the data, while tshift() shifts the index. In both
cases, the shift is specified in multiples of the frequency.

Here we will both shift() and tshift() by 900 days;

fig, ax = plt.subplots(3, sharey=True)

# apply a frequency to the data

goog = goog.asfreq('D', method='pad')

goog.plot(ax=ax[0])

goog.shift(900).plot(ax=ax[1])

goog.tshift(900).plot(ax=ax[2])

# legends and annotations

local_max = pd.to_datetime('2007-11-05')

offset = pd.Timedelta(900, 'D')

ax[0].legend(['input'], loc=2)

ax[0].get_xticklabels()[2].set(weight='heavy', color='red')

ax[0].axvline(local_max, alpha=0.3, color='red')

ax[1].legend(['shift(900)'], loc=2)

ax[1].get_xticklabels()[2].set(weight='heavy', color='red')

ax[1].axvline(local_max + offset, alpha=0.3, color='red')


ax[2].legend(['tshift(900)'], loc=2)
ax[2].get_xticklabels()[1].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red');
Output:

shift(900) shifts the data by 900 days, pushing some of it off the end of the graph
(and leaving NA values at the other end), while tshift(900) shifts the index values by
900 days.

A common context for this type of shift is in computing differences over time. For
example, we use shifted values to compute the one-year return on investment for
Google stock over the course of the dataset:

ROI = 100 * (goog.tshift(-365) / goog - 1)

ROI.plot()

plt.ylabel('% Return on Investment');

Output:
This helps us to see the overall trend in Google stock.
20.6.3 Rolling windows
Rolling statistics are a third type of time series-specific operation implemented by
Pandas. These can be accomplished via the rolling() attribute of Series and
DataFrame objects, which returns a view similar to what we saw with the groupby
operation. This rolling view makes available a number of aggregation operations by
default.
For example, the one-year centered rolling mean and standard deviation of the
Google stock prices:
rolling = goog.rolling(365, center=True)
data = pd.DataFrame({'input': goog,
'one-year rolling_mean': rolling.mean(),
'one-year rolling_std': rolling.std()})
ax = data.plot(style=['-', '--', ':'])
ax.lines[0].set_alpha(0.3)
Output:
As with group-by operations, the aggregate() and apply() methods can be used for
custom rolling computations.

21. High-Performance Pandas: eval() and query()

The power of the PyData stack is built upon the ability of NumPy and Pandas to push
basic operations into C via an intuitive syntax: examples are vectorized/broadcasted
operations in NumPy, and grouping-type operations in Pandas. While these
abstractions are efficient and effective for many common use cases, they often rely
on the creation of temporary intermediate objects, which can cause undue overhead
in computational time and memory use.

Pandas includes some experimental tools that allow us to directly access C-speed
operations without costly allocation of intermediate arrays. These are the eval() and
query() functions which rely on the Numexpr package.

21.1 Query() and eval(): Compound Expressions

NumPy and Pandas support fast vectorized operations; for example, when adding
the elements of two arrays:
import numpy as np

rng = np.random.RandomState(42)

x = rng.rand(1000000)

y = rng.rand(1000000)

%timeit x + y

Output:

100 loops, best of 3: 3.39 ms per loop

This is much faster than doing the addition via a Python loop or comprehension:

%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))

Output:

1 loop, best of 3: 266 ms per loop

But this abstraction can become less efficient when computing compound
expressions. For example, consider the following expression:

mask = (x > 0.5) & (y < 0.5)

Because NumPy evaluates each subexpression, this is equivalent to the following:

tmp1 = (x > 0.5)

tmp2 = (y < 0.5)

mask = tmp1 & tmp2

In other words, every intermediate step is explicitly allocated in memory . If the x


and y arrays are very large, this can lead to significant memory and computational
overhead. The Numexpr library gives you the ability to compute this type of
compound expression element by element, without the need to allocate full
intermediate arrays.

import numexpr

mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')


np.allclose(mask, mask_numexpr)
Output:
True
The advantage here is that Numexpr evaluates the expression in a way that does not
use full-sized temporary arrays, and thus can be much more efficient than NumPy,
especially for large arrays. The Pandas eval() and query() tools are conceptually
similar and depend on the Numexpr package.

21.2 pandas.eval() for Efficient Operations

The eval() function in Pandas uses string expressions to efficiently compute


operations using DataFrames. For example, consider the following DataFrames:

import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
for i in range(4))
To compute the sum of all four DataFrames using the typical Pandas approach, we
can just write the sum:

%timeit df1 + df2 + df3 + df4

Output:

10 loops, best of 3: 87.1 ms per loop

The same result can be computed via pd.eval by constructing the expression as a
string:

%timeit pd.eval('df1 + df2 + df3 + df4')


Output:
10 loops, best of 3: 42.2 ms per loop
The eval() version of this expression is about 50% faster (and uses much less
memory), while giving the same result:
np.allclose(df1 + df2 + df3 + df4,
pd.eval('df1 + df2 + df3 + df4'))
Output:
True
21.3 Operations supported by pd.eval()

pd.eval() supports a wide range of operations. To demonstrate these, we will use


the following integer DataFrames:

df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))


for i in range(5))
● Arithmetic operators

pd.eval() supports all arithmetic operators. For example:


result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)
Output:
True

● Comparison operators

pd.eval() supports all comparison operators, including chained expressions:


result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)
Output:
True

● Bitwise operators

pd.eval() supports the & and | bitwise operators:


result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)
Output:
True
In addition, it supports the use of the literal and and or in Boolean expressions:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)
Output:
True

● Object attributes and indices

pd.eval() supports access to object attributes via the obj.attr syntax and indexes via
the obj[index] syntax:

result1 = df2.T[0] + df3.iloc[1]


result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)
Output:
True

● Other operations

Other operations such as function calls, conditional statements, loops, and other
more involved constructs are currently not implemented in pd.eval(). If we want to
execute these more complicated types of expressions, we can use the Numexpr
library itself.

21.4 DataFrame.eval() for Column-Wise Operations

Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method
that works in similar ways. The benefit of the eval() method is that columns can be
referred to by name. We'll use this labeled array as an example:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()
Output:

Using pd.eval() as above, we can compute expressions with the three columns like
this:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)
Output:
True
The DataFrame.eval() method allows much more succinct evaluation of expressions
with the columns:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)
Output:
True
here we treat column names as variables within the evaluated expression.

21.4.1 Assignment in DataFrame.eval()


DataFrame.eval() also allows assignment to any column. Let's use the DataFrame
which has columns 'A', 'B', and 'C':
df.head()
Output:

We can use df.eval() to create a new column 'D' and assign to it a value computed
from the other columns:
df.eval('D = (A + B) / C', inplace=True)
df.head()
Output:

In the same way, any existing column can be modified:


df.eval('D = (A - B) / C', inplace=True)
df.head()
21.4.2 Local variables in DataFrame.eval()
The DataFrame.eval() method supports an additional syntax that lets it work with
local Python variables. Consider the following:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)
Output:
True
The @ character here marks a variable name rather than a column name and lets us
efficiently evaluate expressions involving the two "namespaces": the namespace of
columns and the namespace of Python objects. This @ character is only supported
by the DataFrame.eval() method and not by the pandas.eval() function, because the
pandas.eval() function only has access to the one (Python) namespace.

21.5 DataFrame.query() Method


The DataFrame has another method based on evaluated strings, called the query()
method. Consider the following:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)
Output:
True
This is an expression involving columns of the DataFrame. It cannot be expressed
using the DataFrame.eval() syntax. Instead, for this type of filtering operation, we
can use the query() method:
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)
Output:
True
In addition to being a more efficient computation, compared to the masking
expression this is much easier to read and understand. The query() method also
accepts the @ flag to mark local variables:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)
Output:
True

21.6 Performance: When to Use These Functions

When considering whether to use these functions, there are two considerations:
computation time and memory use. Memory use is the most predictable aspect.
Every compound expression involving NumPy arrays or Pandas DataFrames will
result in implicit creation of temporary arrays: For example, this:
x = df[(df.A < 0.5) & (df.B < 0.5)]
Is roughly equivalent to this:
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]

If the size of the temporary DataFrames is significant compared to our available


system memory (typically several gigabytes) then it's a good to use an eval() or
query() expression. We can check the approximate size of our array in bytes using
this:
df.values.nbytes
Output:
32000
VIDEO LINKS
Unit – II
VIDEO LINKS
Sl. Topic Video Link
No.
1 Aggregations https://www.youtube.com/watch?v=2I2E1ZbF8pg

2 Computation on Arrays
https://www.youtube.com/watch?v=QD6IBF0Hic4

3 Indexing https://www.youtube.com/watch?v=WpXH4PzDtYA

4 Sorting arrays https://www.youtube.com/watch?v=fD4aKa0TeQM


10. ASSIGNMENT : UNIT – II

1. Build a simple Recipe Recommendation System using Pandas. (CO2,K3)

2. Take an open recipe database compiled from various sources on the


Web and use vectorized string operations to parse the recipe data into
ingredient lists. (CO2,K3)

3. Using time series data, Visualize the Seattle Bicycle Counts. (CO2,K3)
PART-A Q&A UNIT-II
11. PART A : Q & A : UNIT – II
1. How the series object can be modified ? (CO2,K3)

Series objects can be modified with a dictionary-like syntax. Just as we can extend a
dictionary by assigning to a new key, we can extend a Series by assigning to a new
index value.

2. What is python none object ? (CO2,K2)

The first sentinel value used by Pandas is None, a Python singleton object that is
often used for missing data in Python code. Because it is a Python object, None
cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data
type 'object' i.e. arrays of Python objects.

3. What is the use of multi-indexing ? (CO2,K2)

Multi-indexing is used to represent two-dimensional data within a one-dimensional


Series. We can also use it to represent data of three or more dimensions in a Series
or DataFrame. Each extra level in a multi-index represents an extra dimension of
data.

4. What is pd.merge() function? (CO2,K2)

The pd.merge() function implements a number of types of joins: the one-to-one,


many-to-one and many-to-many joins. All three types of joins are accessed via an
identical call to the pd.merge() interface. The type of join performed depends on
the form of the input data.

5. What is describe() method? (CO2,K2)

The method describe() computes several common aggregates for each column and
returns the result. We can use this method on the dataset for dropping rows with
missing values.
11. PART A : Q & A : UNIT – II
6. What is split, apply and combine? (CO2,K3)
● The split step involves breaking up and grouping a data frame depending on
the value of the specified key.
● The apply step involves computing some function usually an aggregate,
transformation, or filtering within the individual groups.
● The combine step merges the results of these operations into an output
array.

7. What is the use of get() and slice() operations? (CO2,K3)


The get() and slice() operations enable vectorized element access from each
array. For example, we can get a slice of the first three characters of each array
using str.slice(0, 3).
get() and slice() methods also let us access elements of arrays returned by split().
For example, to extract the last name of each entry, we can combine split() and
get().

8. What do you mean by datetime and dateutil? (CO2,K2)


The datetime type is used to manually build a date. Using the dateutil module, we
can parse dates from a variety of string formats. With datetime object, we can print
the day of the week.

9. What is the advantage of using numexpr library? (CO2,K2)


The Numexpr library gives the ability to compute compound expressions element by
element without the need to allocate full intermediate arrays.

10. What are eval() and query() tools ? (CO2,K2)

Numexpr evaluates the expression in a way that does not use full-sized temporary
arrays and can be much more efficient than NumPy, especially for large arrays. The
Pandas eval() and query() tools are conceptually similar and depend on the
Numexpr package.
12. PART B QUESTIONS : UNIT – II
1. Explain the creation of multidimensional arrays with examples in Numpy.
( CO2,K3)
2. Write a short notes on the following ( CO2 , K2)

i. Indexing of Ndarrays

ii. Ndarray properties

iii. Numpy Constants

3. Briefly explain Data Visualization Numpy Ndarray Creation ( CO2 , K2)


4. Explain Pandas series and Dataframe in detail. ( CO2 , K2)
5. Illustrate the steps involved in Visualizing the Data in Dataframes. ( CO2 , K2)
6. Explain Pandas object with examples. (CO2, K3)
7. Write short notes on (CO2, K3)

i. The Pandas DataFrame Object

ii. The Pandas Index Object

8. Explain Data Indexing and Selection in detail.( CO2, K3)


9. Elucidate the steps involved in handling missing data in pandas. ( CO2, K3)
10. Explain the methods of Combining Datasets. ( CO2, K3)
11. Explain in detail about the aggregate, filter, transform and apply operations of
the GroupBy object. (CO2,K3)
12. Write short notes on Pivot tables. ( CO2, K3)
13. Explain the Pandas String Operations in detail. ( CO2, K3)
14. Write short notes on dates and times in pandas with examples. (CO2,K3)
13. SUPPORTIVE ONLINE CERTIFICATION COURSES

NPTEL : https://onlinecourses.nptel.ac.in/noc21_cs69/preview?

coursera : https://www.coursera.org/learn/python-data-analysis

Udemy : https://www.udemy.com/topic/data-science/

Mooc : https://mooc.es/course/introduction-to-data-science-in-python/

edx : https://learning.edx.org/course/course-v1:Microsoft+DAT208x+2T2016/home
14. REAL TIME APPLICATIONS

Major Applications of Pandas

In this list we will cover the most fundamental applications of Pandas:

1. Economics Economics is in constant demand for data analysis. Analyzing data to


form patterns and understanding trends about how the economy in various sectors is
growing, is something very essential for economists. Therefore, a lot of economists
have started using Python and Pandas to analyze huge datasets. Pandas provide a
comprehensive set of tools, like dataframes and file-handling. These tools help
immensely in accessing and manipulating data to get the desired results. Through
these applications of Pandas, economists all around the world have been able to
make breakthroughs like never before.

2. Recommendation Systems We all have used Spotify or Netflix and been


appalled at the brilliant recommendations provided by these sites. These systems are
a miracle of Deep Learning. Such models for providing recommendations is one of
the most important applications of Pandas. Mostly, these models are made in python
and Pandas being the main libraries of python, used when handling data in such
models. We know that Pandas are best for managing huge amounts of data. And the
recommendation system is possible only by learning and handling huge masses of
data. Functions like groupBy and mapping help tremendously in making these
systems possible.

3. Stock Prediction The stock market is extremely volatile. However, that doesn’t
mean that it cannot be predicted. With the help of Pandas and a few other libraries
like NumPy and matplotlib, we can easily make models which can predict how the
stock markets turn out.
This is possible because there is a lot of previous data of stocks which tells us about
how they behave. And by learning these data of stocks, a model can easily predict
the next move to be taken with some accuracy. Not only this, but people can also
automate buying and selling of stocks with the help of such prediction models.

4. Neuroscience Understanding the nervous system has always been in the minds
of humankind because there are a lot of potential mysteries about our bodies which
we haven’t solved as of yet. Machine learning has helped this field immensely with
the help of the various applications of Pandas. Again, the data manipulation
capabilities of Pandas have played a major role in compiling a huge amount of data
which has helped neuroscientists in understanding trends that are followed inside
our bodies and the effect of various things on our entire nervous system.

5. Statistics Pure maths itself has made much progress with the various
applications of Pandas. Since Statistic deals with a lot of data, a library like Pandas
which deals with data handling has helped in a lot of different ways. The functions of
mean, median and mode are just very basic ones which help in performing statistical
calculations. There are a lot of other complex functions associated with statistics and
pandas plays a huge role in these so as to bring perfect results.

6. Advertising Advertising has taken a huge leap in the 21st Century. Nowadays
advertising has become very personalized which helps companies to get more and
more customers. This again has been possible only because of the likes of Machine
Learning and Deep Learning. Models going through customer data learn to
understand what exactly the customer wants, providing companies with great
advertisement ideas. There are many applications of Pandas in this. The customer
data often rendered with the help of this library, and a lot of functions present in
Pandas also help.
15. CONTENTS BEYOND SYLLABUS : UNIT – II

Random Numbers in NumPy

Pseudo Random and True Random.

Computers work on programs, and programs are definitive set of instructions.


So it means there must be some algorithm to generate a random number as
well. If there is a program to generate random number it can be predicted, thus
it is not truly random. Random numbers generated through a generation
algorithm are called pseudo random.

In order to generate a truly random number on our computers we need to get


the random data from some outside source. This outside source is generally our
keystrokes, mouse movements, data on network etc. We do not need truly
random numbers, unless its related to security (e.g. encryption keys) or the
basis of application is the randomness (e.g. Digital roulette wheels).

Visualize Distributions With Seaborn

Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be


used to visualize random distributions.

Distplots

Distplot stands for distribution plot, it takes as input an array and plots a curve
corresponding to the distribution of points in the array.

Plotting a Displot

import matplotlib.pyplot as plt


import seaborn as sns
sns.distplot([0, 1, 2, 3, 4, 5])
plt.show()
Assessment Schedule
(Proposed Date &
Actual Date)
16.Assessment Schedule
(Proposed Date & Actual Date)
Sl. ASSESSMENT Proposed Actual
No. Date Date
1 FIRST INTERNAL ASSESSMENT 09.09.2023

2 SECOND INTERNAL ASSESSMENT 26.10.2023

3 MODEL EXAMINATION 15.11.2023

4 END SEMESTER EXAMINATION 05.12.2023


17. PRESCRIBED TEXT BOOKS & REFERENCE BOOKS

TEXTBOOKS:

1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data


Science”, Manning Publications, 2016.
2. Ashwin Pajankar, Aditya Joshi, Hands-on Machine Learning with Python:
Implement Neural Network Solutions with Scikit-learn and PyTorch, Apress,
2022.
3. Jake VanderPlas, “Python Data Science Handbook – Essential tools for
working with data”, O’Reilly, 2017.

REFERENCES:

1. Roger D. Peng, R Programming for Data Science, Lulu.com, 2016

2. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques",
3rd Edition, Morgan Kaufmann, 2012.

3. Samir Madhavan, Mastering Python for Data Science, Packt Publishing, 2015

4. Laura Igual, Santi Seguí, "Introduction to Data Science: A Python Approach to


Concepts,

5. Techniques and Applications", 1st Edition, Springer, 2017

6. Peter Bruce, Andrew Bruce, "Practical Statistics for Data Scientists: 50 Essential

7. Concepts", 3rd Edition, O'Reilly, 2017

8. Hector Guerrero, “Excel Data Analysis: Modelling and Simulation”, Springer

International Publishing, 2nd Edition, 2019

E-Book links:

1. https://drive.google.com/file/d/1HoGVyZqLTQj0aA4THA__D4jJ74czxEKH/view
?usp=sharing
2. https://drive.google.com/file/d/1vJfX5xipCHZOleWfM9aUeK8mwsal6Il1/view?u
sp=sharing
3. https://drive.google.com/file/d/1aU2UKdLxLdGpmI73S1bifK8JPiMXlpoS/view?
usp=sharing
18. MINI PROJECT SUGGESTION

Mini Projects

a) Recommendation system
b) Credit Card Fraud Detection
c) Fake News Detection
d) Customer Segmentation
e) Sentiment Analysis
f) Recommender Systems
g) Emotion Recognition
h) Stock Market Prediction
i) Email classification
j) Tweets classification
k) Uber Data Analysis
l) Social Network Analysis
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not the
intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance
on the contents of this information is strictly prohibited.

You might also like