DSP U2
DSP U2
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
22AI302
DATA SCIENCE USING
PYTHON
Department: AI & DS
Batch/Year: 2022-2026 /II YEAR
Created by:
Ms.Divya D M / Asst.Professor
Date: 27.07.2023
1.Table of Contents
1. Contents 5
2. Course Objectives 6
3. Pre-Requisites 7
4. Syllabus 8
5. Course outcomes 11
7. Lecture Plan 14
9. Lecture notes 17
Semester-III
Semester-II
Python Programming
Semester-I
C Programming
4.SYLLABUS
22AI302 DATA SCIENCE USING PYTHON LTPC
2023
Data Science: Benefits and uses – facets of data - Data Science Process: Overview –
Defining research goals – Retrieving data – data preparation - Exploratory Data
analysis – build the model – presenting findings and building applications - Data
Mining - Data Warehousing – Basic statistical descriptions of Data.
List of Exercise/Experiments:
1. Download, install and explore the features of R/Python for data analytics
• Installing Anaconda
• Basic Operations in Jupyter Notebook
• Basic Data Handling
List of Exercise/Experiments:
1. Working with Numpy arrays - Creation of numpy array using the tuple, Determine
the size, shape and dimension of the array, Manipulation with array Attributes,
Creation of Sub array, Perform the reshaping of the array along the row vector and
column vector, Create Two arrays and perform the concatenation among the arrays.
2. Working with Pandas data frames - Series, DataFrame , and Index, Implement the
Data Selection Operations, Data indexing operations like: loc, iloc, and ix, operations
of handling the missing data like None, Nan, Manipulate on the operation of Null
Values (is null(), not null(), dropna(), fillna()).
4.SYLLABUS
3. Perform the Statistics operation for the data (the sum, product, median, minimum
and maximum, quantiles, arg min, arg max etc.).
4. Use any data set compute the mean ,standard deviation, Percentile.
List of Exercise/Experiments:
1. Apply Decision Tree algorithms on any data set.
2. Apply SVM on any data set
3. Implement K-Nearest-Neighbor Classifiers
List of Exercise/Experiments:
1. Apply K-means algorithms for any data set.
2. Perform Outlier Analysis on any data set.
List of Exercise/Experiments:
1. Basic plots using Matplotlib.
2. Implementation of Scatter Plot.
3. Construction of Histogram, bar plot, Subplots, Line Plots.
4.SYLLABUS
TEXTBOOKS:
REFERENCES:
1. Roger D. Peng, R Programming for Data Science, Lulu.com, 2016
2. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques",
3rd Edition, Morgan Kaufmann, 2012.
3. Samir Madhavan, Mastering Python for Data Science, Packt Publishing, 2015
4. Laura Igual, Santi Seguí, "Introduction to Data Science: A Python Approach to
Concepts,
5. Techniques and Applications", 1st Edition, Springer, 2017
6. Peter Bruce, Andrew Bruce, "Practical Statistics for Data Scientists: 50 Essential
7. Concepts", 3rd Edition, O'Reilly, 2017
8. Hector Guerrero, “Excel Data Analysis: Modelling and Simulation”, Springer
International Publishing, 2nd Edition, 2019
NPTEL Courses:
a. Data Science for Engineers - https://onlinecourses.nptel.ac.in/noc23_cs17/preview
b. Python for Data Science - https://onlinecourses.nptel.ac.in/noc23_cs21/preview
LIST OF EQUIPMENTS:
Systems with Anaconda, Jupyter Notebook, Python, Pandas, NumPy, MatPlotlib
5.COURSE OUTCOMES
2 3 3 3 3 1 1 1 1 2 3 3
3 3 3 3 3 3 3 3 3 2 3 3
4 3 3 3 3 3 3 3 3 2 3 3
5 3 3 3 3 3 3 3 3 2 3 3
Lecture Plan
Unit - II
LECTURE PLAN – Unit 2- PYTHON LIBRARIES FOR DATA SCIENCE
Sl. Topic Numbe Proposed Actual CO Tax Mode of
No r of Date Lecture ono Delivery
. Periods Date my
Leve
l
Introduction to
Numpy - PPT / Chalk
1 1 26.08.2023 CO2 K2
Multidimensional & Talk
Ndarrays
Indexing –
PPT / Chalk
2 Properties-Const 1 28.08.2023 CO2 K2
& Talk
ants –
Data
Visualization:
PPT / Chalk
3 Ndarray Creation 1 29.08.2023 CO2 K2
& Talk
– Matplotlib
Pandas Objects
PPT / Chalk
6 – Data Indexing 1 01.09.2023 CO2 K2
& Talk
and Selection
Handling missing
data - PPT / Chalk
7. 1 02.09.2023 CO2 K2
Hierarchical & Talk
indexing –
Combining
datasets PPT / Chalk
8. 1 04.09.2023 CO2 K2
-Aggregation & Talk
and Grouping
Joins-
PPT / Chalk
9. Pivot Tables - 1 05.09.2023 CO2 K2
& Talk
String operations
Working with
time series –
PPT / Chalk
10. High 1 07.09.2023 CO2 K2
& Talk
performance
Pandas.
8. ACTIVITY BASED LEARNING
Activity name:
Students will have better understanding about how the python libraries and
other features of python work with any datasets.
Guidelines to do an activity :
4) Conduct Peer review. ( each team will be reviewed by all other teams and
mentors )
Useful links:
https://towardsdatascience.com/creating-and-automating-an-interactive-dashb
oard-using-python-5d9dfa170206
https://github.com/tsbloxsom/Texas-census-county-data-project
UNIT-II
PYTHON LIBRARIES FOR DATA SCIENCE
9.LECTURE NOTES
1. Introduction to Numpy
• NumPy is the fundamental library for the numerical computation. It is an
integral part of the Scientific Python Ecosystem.
• NumPy is important because it is used to store the data. It has a basic yet
very versatile data structure known as Ndarray. It means N Dimensional
Array. Python has many array-like data structures (e.g., list). But Ndarray is
the most versatile and the most preferred structure to store scientific and
numerical data.
• Many libraries have their own data structures, and most of them use Ndarrays
as their base. And Ndarrays are compatible with many data structures and
routine just like the lists. Let us create a simple Ndarray as follows:
import numpy as np
lst1 = [1, 2, 3]
arr1 = np.array(lst1)
• Here, we are importing NumPy as an alias. Then, we are creating a list and
passing it as an argument to the function array().
• Let’s see the data types of all the variables used:
print(type(lst1))
print(type(arr1))
<class 'list'>
<class 'numpy.ndarray'>
Let’s see the contents of the Ndarray as follows:
arr1
The output is as follows:
array([1, 2, 3])
We can write it in a single line as follows:
We can specify the data type of the members of the Ndarray as follows:
2. Multidimensional Ndarrays
array([[1, 2, 3],
arr1
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[ 0, 0, 0]],
[ 1, 1, 1]]], dtype=int16)
3. Indexing of Ndarrays
• We can address the elements (also called as the members) of the Ndarrays
individually. Let’s see how to do it with one-dimensional Ndarrays:
print(arr1[0])
print(arr1[1])
print(arr1[2])
• Just like lists, it follows C style indexing where the first element is at the
position of 0 and the nth element is at the position (n-1).
• We can also see the last element with negative location number as follows:
print(arr1[-1])
print(arr1[-2])
print(arr1[3])
--------------------------------------------------------------------------
----> 1 print(arr1[3])
print(arr1[0, 0]);
print(arr1[0, 1]);
print(arr1[0, 2]);
Let us learn all the properties with the demonstration. Let us use the same 3D matrix
we used earlier:
x2 = np.array([[[1, 2, 3], [4, 5, 6]],[[0, -1, -2], [-3, -4, -5]]], np.int16)
print(x2.ndim)
print(x2.shape)
(2, 2, 3)
print(x2.dtype)
int16
We can know the size (number of elements) and the number of bytes required in the
memory for the storage as follows:
print(x2.size)
print(x2.nbytes)
12
24
print(x2.T)
5. NumPy Constants
NumPy library has many useful mathematical and scientific constants we can use in
programs. The following code snippet prints all such important constants:
print(np.inf)
print(np.NAN)
print(np.NINF)
print(np.NZERO)
print(np.PZERO)
print(np.e)
print(np.euler_gamma)
print(np.pi)
inf
nan
-inf
-0.0
0.0
2.718281828459045
0.5772156649015329
3.141592653589793
6. Data Visualization: Numpy routines for Ndarray Creation
The routine np.empty() creates an empty array of given size. The elements of the
array are random as the array is not initialized.
import numpy as np
print(x)
It will output an array with random numbers. And the output maybe different in your
case as the numbers are random. We can create multidimensional matrices as
follows:
print(x)
We can use the routine np.eye() to create a matrix of all zeros except the diagonal
elements of all the zeros. The diagonal has all the ones.
y = np.eye(4, dtype=np.uint8)
print(y)
[[1 0 0 0]
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]]
print(y)
The output is as follows:
[[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
[0 0 0 0]]
We can even have the negative value for the position of the diagonal with all ones as
follows:
print(y)
The function np.identity() returns an identity matrix of the specified size. An identity
matrix is a matrix where all the elements at the diagonal are 1 and the rest of the
elements are 0. The following are a few examples of that:
print(x)
print(x)
The routine np.ones() returns the matrix of the given size that has all the elements
as ones. Run the following examples to see it in action:
print(x)
print(x)
Let us have a look at the routine arange(). It creates a Ndarray of evenly spaced
values with the given interval. An argument for the stop value is compulsory. The
start value and interval parameters have default arguments 0 and 1, respectively.
Let us see an example:
np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
The routine linspace() returns a Ndarray of evenly spaced numbers over a specified
interval. We must pass it the starting value, the end value, and the number of values
as follows:
np.logspace(0.1, 2, 10)
The following command is known as magic command that enables Jupyter Notebook
to show Matplotlib visualizations:
%matplotlib inline
x = np.arange(10)
y=x+1
plt.plot(x, y)
plt.show()
x = np.arange(10)
y1 = 1 – x
plt.plot(x, y, x, y1)
plt.show()
As we can see, the routine plt.plot() can visualize data as simple lines. We can also
plot data of other forms with it. The limitation is that it must be single dimensional.
Let’s draw a sine wave as follows:
n=3
y = np.sin( n * t )
plt.plot(t, y)
plt.show()
The output is shown in Figure 3-3.
We can also have other types of plots. Let’s visualize a bar plot.
n=5
x = np.arange(n)
y = np.random.rand(n)
plt.bar(x, y)
plt.show()
fig, ax = plt.subplots()
ax.bar(x, y)
ax.set_title('Bar Graph')
ax.set_xlabel('X')
ax.set_ylabel('Y')
plt.show()
As we can see, the code creates a figure and an axis that we can use to call
visualization routines and to set the properties of the visualizations.
Let’s see how to create subplots. Subplots are the plots within the visualization. We
can create them as follows:
x = np.arange(10)
plt.subplot(2, 2, 1)
plt.plot(x, x)
plt.title('Linear')
plt.subplot(2, 2, 2)
plt.plot(x, x*x)
plt.title('Quadratic')
plt.subplot(2, 2, 3)
plt.plot(x, np.sqrt(x))
plt.title('Square root')
plt.subplot(2, 2, 4)
plt.plot(x, np.log(x))
plt.title('Log')
plt.tight_layout()
plt.show()
As we can see, we are creating a subplot before each plotting routine call. The
routine tight_layout() creates enough spacing between subplots. The output is as
shown in Figure 3-5.
fig, ax = plt.subplots(2, 2)
ax[0][0].plot(x, x)
ax[0][0].set_title('Linear')
ax[0][1].plot(x, x*x)
ax[0][1].set_title('Quadratic')
ax[1][0].plot(x, np.sqrt(x))
ax[1][0].set_title('Square Root')
ax[1][1].plot(x, np.log(x))
ax[1][1].set_title('Log')
plt.subplots_adjust(left=0.1,
bottom=0.1,
right=0.9,
top=0.9,
wspace=0.4,
hspace=0.4)
plt.show()
Let’s move ahead with the scatter plot. We can visualize 2D data as scatter plot as
follows:
n = 100
x = np.random.rand(n)
y = np.random.rand(n)
plt.scatter(x, y)
plt.show()
plt.hist(x)
plt.show()
Here, mu means mean, and sigma means standard deviation. The output is as
shown in Figure 3-7.
plt.pie(x)
plt.show()
• Pandas is the data analytics and data science library of the Scientific Python
Ecosystem. Just like NumPy, Matplotlib, IPython, and Jupyter Notebook, it is
an integral part of the ecosystem.
• It is used for storage, manipulation, and visualization of multidimensional
data. It is more flexible than Ndarrays and also compatible with it. It means
that we can use Ndarrays to create Pandas data structures.
• Let’s create a new notebook for the demonstrations in this chapter. We can
install Pandas with the following command in the Jupyter Notebook session:
The following code imports the library to the current program or Jupyter Notebook
session:
import pandas as pd
%matplotlib inline
import pandas as pd
import numpy as np
s1 = pd.Series([1, 2, 3 , 4, 5])
If we type the following code:
type(s1)
pandas.core.series.Series
s2 = pd.Series(np.arange(5), dtype=np.uint8)
s2
0 0
1 1
2 2
3 3
4 4
dtype: uint8
• The first column is the index, and the second column is the data column. We
can create a series by using an already defined Ndarray as follows,
s3 = pd.Series(arr1, dtype=np.int16)
s3
In this case, the data type of the series will be considered as the final data type.
7.2 Properties of Series
s3.values
We can also check the values of the series with the following code:
s3.array
<PandasArray>
[0, 1, 2, 3, 4]
s3.index
s3.dtype
s3.shape
s3.size
We can check the number of bytes as follows:
s3.nbytes
s3.ndim
df1 = pd.DataFrame(data)
print(df1)
df1.head()
Run this and see the output. We can also create the dataframe with a specific order
of columns as follows:
• We have learned the data visualization of NumPy data with the data
visualization library Matplotlib.
• Now, we will learn how to visualize Pandas data structures.
• Objects of Pandas data structures call Matplotlib visualization functions like
plot(). Basically, Pandas provides a wrapper for all these functions. Let us see
a simple example as follows:
df1 = pd.DataFrame()
df1['A'] = pd.Series(list(range(100)))
df1['B'] = np.random.randn(100, 1)
df1
So this code creates a dataframe. Let’s plot it now:
df1.plot(x='A', y='B')
plt.show()
• Now let’s explore the other plotting methods. We will create a dataset of four
columns.
• The columns will have random data generated with NumPy. So your output
will be definitely different.
• We will use the generated dataset for the rest of the examples. So let’s
generate the dataset:
print(df2)
It generates data like below,
df2.plot.bar()
plt.show()
df2.plot.barh()
plt.show()
df2.plot.bar(stacked = True)
plt.show()
df2.plot.barh(stacked = True)
plt.show()
df2.plot.hist(alpha=0.7)
plt.show()
df2.plot.hist(stacked=True, alpha=0.7)
plt.show()
plt.show()
df2.plot.box()
plt.show()
df2.plot.area()
plt.show()
df2.plot.area(stacked=False)
plt.show()
9. Pandas Objects
data
Output:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
In the output, the Series wraps both a sequence of values and a sequence of
indices, which we can access with the values and index attributes. The values are
simply a familiar NumPy array:
data.values
Output:
array([ 0.25, 0.5 , 0.75, 1. ])
The index is an array-like object of type pd.Index
data.index
Output:
RangeIndex(start=0, stop=4, step=1)
Like with a NumPy array, data can be accessed by the associated index via the
familiar Python square-bracket notation:
data[1]
Output:
0.5
Data[1:3]
Output:
1 0.50
2 0.75
dtype: float64
Pandas Series is much more general and flexible than the one-dimensional NumPy
array.
9.2 Series as generalized NumPy array
While the Numpy Array has an implicitly defined integer index used to access the
values, the Pandas Series has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For
example, the index need not be an integer, but can consist of values of any desired
type. For example, if we wish, we can use strings as an index:
data['b']
Output:
0.5
We can even use non-contiguous or non-sequential indices:
Pandas Series makes it much more efficient than Python dictionaries for certain
operations.
population['California':'Illinois']
Output:
California 38332521
Florida 19552860
Illinois 12882135
dtype: int64
9.4 Constructing Series objects
pd.Series(data, index=index)
where index is an optional argument, and data can be one of many entities.
For example, data can be a list or NumPy array, in which case index defaults to an
integer sequence:
pd.Series([2, 4, 6])
Output:
0 2
1 4
2 6
dtype: int64
Output:
100 5
200 5
300 5
dtype: int64
data can be a dictionary, in which index defaults to the sorted dictionary keys:
Output:
1 b
2 a
3 c
dtype: object
In each case, the index can be explicitly set if a different result is preferred:
Output:
3 c
2 a
dtype: object
In this case, the Series is populated only with the explicitly identified keys.
'area': area})
states
Output:
Like the Series object, the DataFrame has an index attribute that gives access to the
index labels:
states.index
Output:
states.columns
Output:
states['area']
Output:
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
10.3 Constructing DataFrame objects
pd.DataFrame(population, columns=['population'])
Output:
● From a list of dicts
Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e.
"not a number") values:
Output:
A Pandas DataFrame operates much like a structured array and can be created
directly from one:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
Output:
array([(0, 0.0), (0, 0.0), (0, 0.0)], dtype=[('A', '<i8'), ('B', '<f8')])
pd.DataFrame(A)
Output:
The Index in many ways operates like an array. For example, we can use standard
Python indexing notation to retrieve values or slices:
ind[1]
Output:
ind[::2]
Output:
Index objects also have many of the attributes familiar from NumPy arrays:
Output:
5 (5,) 1 int64
One difference between Index objects and NumPy arrays is that indices are
immutable–that is, they cannot be modified via the normal means:
ind[1] = 0
Output:
This immutability makes it safer to share indices between multiple DataFrames and
arrays, without the potential for side effects from inadvertent index modification.
Pandas objects are designed to facilitate operations such as joins across datasets,
which depend on many aspects of set arithmetic. The Index object follows many of
the conventions used by Python's built-in set data structure, so that unions,
intersections, differences, and other combinations can be computed in a familiar
way:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
Output:
Int64Index([3, 5, 7], dtype='int64')
indA | indB # union
Output:
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB # symmetric difference
Output:
Int64Index([1, 2, 9, 11], dtype='int64')
These operations may also be accessed via object methods, for example
indA.intersection(indB).
Output:
0.5
We can also use dictionary-like Python expressions and methods to examine the
keys/indices and values:
'a' in data
Output:
True
data.keys()
Output:
list(data.items())
Output:
Series objects can even be modified with a dictionary-like syntax. Just as we can extend
a dictionary by assigning to a new key, we can extend a Series by assigning to a new
index value:
data['e'] = 1.25
data
Output:
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
12.1.2 Series as one-dimensional array
A Series builds on this dictionary-like interface and provides array-style item selection
via the same basic mechanisms as NumPy arrays: that is, slices, masking, and fancy
indexing. Examples of these are as follows:
a 0.25
b 0.50
dtype: float64
# masking
data[(data > 0.3) & (data < 0.8)]
Output:
b 0.50
c 0.75
dtype: float64
# fancy indexing
data[['a', 'e']]
Output:
a 0.25
e 1.25
dtype: float64
When slicing with an explicit index (i.e., data['a':'c']), the final index is included in the
slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is
excluded from the slice.
If Series has an explicit integer index, an indexing operation such as data[1] will use
the explicit indices, while a slicing operation like data[1:3] will use the implicit
Python-style index.
data
Output:
1 a
3 b
5 c
dtype: object
# explicit index when indexing
data[1]
Output:
'a'
# implicit index when slicing
data[1:3]
Output:
3 b
5 c
dtype: object
Because of this potential confusion in the case of integer indexes, Pandas provides
some special indexer attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing
interface to the data in the Series.
First, the loc attribute allows indexing and slicing that always references the explicit
index:
data.loc[1]
Output:
'a'
data.loc[1:3]
Output:
1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the implicit
Python-style index:
data.iloc[1]
Output:
'b'
data.iloc[1:3]
Output:
3 b
5 c
dtype: object
12.2 Data Selection in DataFrame
data
Output:
The individual Series that make up the columns of the DataFrame can be accessed
via dictionary-style indexing of the column name:
data['area']
Output:
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
We can use attribute-style access with column names that are strings:
data.area
Output:
California 423967
Florida 170312
Illinois 149995
Texas 695662
This attribute-style column access actually accesses the exact same object as the
dictionary-style access:
data.area is data['area']
Output:
True
For example, if the column names are not strings, or if the column names conflict
with methods of the DataFrame, this attribute-style access is not possible. For
example, the DataFrame has a pop() method, so data.pop will point to this rather
than the "pop" column:
data.pop is data['pop']
Output:
False
Like with the Series objects, this dictionary-style syntax can also be used to modify
the object, in this case adding a new column:
data
Output:
12.2.2 DataFrame as two-dimensional array
data.values
Output:
data.values[0]
Output:
data['area']
Output:
California 423967
Florida 170312
Illinois 149995
Texas 695662
Using the iloc indexer, we can index the underlying array as if it is a simple NumPy
array (using the implicit Python-style index), but the DataFrame index and column
labels are maintained in the result:
data.iloc[:3, :2]
Output:
Similarly, using the loc indexer we can index the underlying data in an array-like
style but using the explicit index and column names:
data.loc[:'Illinois', :'pop']
Output:
data.ix[:3, :'pop']
Output:
In the loc indexer we can combine masking and fancy indexing as in the following:
Output:
data.iloc[0, 2] = 90
data
Output:
12.2.3 Additional indexing conventions
data['Florida':'Illinois']
Output:
Such slices can also refer to rows by number rather than by index:
data[1:3]
Output:
Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
Output:
There are two strategies to indicate the presence of missing data in a table or DataFrame.
That is using a mask that globally indicates missing values or choosing a sentinel value that
indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean array, or it may
involve appropriation of one bit in the data representation to locally indicate the null status
of a value.
In the sentinel approach, the sentinel value could be some data-specific convention,
such as indicating a missing integer value with -9999 or some rare bit pattern, or it
could be a more global convention, such as indicating a missing floating-point value
with NaN (Not a Number), a special value which is part of the IEEE floating-point
specification.
As in most cases where no universally optimal choice exists, different languages and
systems use different conventions.
Pandas chose to use sentinels for missing data and further chose to use two
already-existing Python null values: the special floating point NaN value and the
Python None object.
The first sentinel value used by Pandas is None, a Python singleton object that is
often used for missing data in Python code. Because it is a Python object, None
cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data
type 'object' (i.e., arrays of Python objects):
import numpy as np
import pandas as pd
vals1
Output:
print()
Output:
dtype = object
dtype = int
The use of Python objects in an array also means that if we perform aggregations
like sum() or min() across an array with a None value, we will generally get an error:
vals1.sum()
Output:
NaN (acronym for Not a Number) is a special floating-point value recognized by all
systems that use the standard IEEE floating-point representation:
vals2.dtype
Output:
dtype('float64')
The result of arithmetic with NaN will be another NaN:
1 + np.nan
Output:
nan
0 * np.nan
Output:
nan
vals2.sum(), vals2.min(), vals2.max()
Output:
(nan, nan, nan)
NumPy does provide some special aggregations that will ignore these missing values:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
Output:
(8.0, 1.0, 4.0)
NaN is specifically a floating-point value. There is no equivalent NaN value for
integers, strings, or other types.
13.2.3 NaN and None in Pandas
pd.Series([1, np.nan, 2, None])
Output:
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
For types that don't have an available sentinel value, Pandas automatically
type-casts when NA values are present. For example, if we set a value in an integer
array to np.nan, it will automatically be upcast to a floating-point type to
accommodate the NA:
x = pd.Series(range(2), dtype=int)
Output:
0 0
1 1
dtype: int64
x[0] = None
Output:
0 NaN
1 1.0
dtype: float64
The following table lists the upcasting conventions in Pandas when NA values are
introduced:
In Pandas, string data is always stored with an object dtype.
Pandas treats None and NaN as essentially interchangeable for indicating missing or
null values. To facilitate this convention, there are several useful methods for
detecting, removing, and replacing null values in Pandas data structures. They are:
Pandas data structures have two useful methods for detecting null data: isnull() and
notnull(). Either one will return a Boolean mask over the data. For example:
data.isnull()
Output:
0 False
1 True
2 False
3 True
dtype: bool
data[data.notnull()]
Output:
0 1
2 hello
dtype: object
The isnull() and notnull() methods produce similar Boolean results for DataFrames.
13.3.2 Dropping null values
In addition to the masking used before, there are the convenience methods,
dropna() (which removes NA values) and fillna() (which fills in NA values). For a
Series, the result is straightforward:
data.dropna()
Output:
0 1
2 hello
dtype: object
Consider the following DataFrame:
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
Output:
We cannot drop single values from a DataFrame; we can only drop full rows or full
columns.
By default, dropna() will drop all rows in which any null value is present:
df.dropna()
Output:
Alternatively, we can drop NA values along a different axis; axis=1 drops all columns
containing a null value:
df.dropna(axis='columns')
Output:
The default is how='any', such that any row or column (depending on the axis
keyword) containing a null value will be dropped. We can also specify how='all',
which will only drop rows/columns that are all null values:
df[3] = np.nan
df
Output:
df.dropna(axis='columns', how='all')
Output:
The thresh parameter lets us to specify a minimum number of non-null values for
the row/column to be kept:
df.dropna(axis='rows', thresh=3)
Output:
Here the first and last row have been dropped, because they contain only two
non-null values.
13.3.3 Filling null values
Pandas provides the fillna() method, which returns a copy of the array with the null
values replaced.
data.fillna(0)
Output:
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
We can specify a forward-fill to propagate the previous value forward: # forward-fill
data.fillna(method='ffill')
Output:
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
Or we can specify a back-fill to propagate the next values backward:
# back-fill
data.fillna(method='bfill')
Output:
a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64
For DataFrames, the options are similar, but we can also specify an axis along which
the fills take place:
df
Output:
df.fillna(method='ffill', axis=1)
Output:
if a previous value is not available during a forward fill, the NA value remains.
import pandas as pd
import numpy as np
14.1 A Multiply Indexed Series
index = pd.MultiIndex.from_tuples(index)
index
Output:
The MultiIndex contains multiple levels of indexing–in this case, the state names and
the years, as well as multiple labels for each data point which encode these levels.
If we re-index our series with this MultiIndex, we see the hierarchical representation
of the data:
pop = pop.reindex(index)
pop
Output:
2010 37253956
2010 19378102
2010 25145561
dtype: int64
Here the first two columns of the Series representation show the multiple index
values, while the third column shows the data. Some entries are missing in the first
column: in this multi-index representation, any blank entry indicates the same value
as the line above it.
Now to access all data for which the second index is 2010, we can simply use the
Pandas slicing notation:
pop[:, 2010]
Output:
California 37253956
New York 19378102
Texas 25145561
dtype: int64
The result is a singly indexed array with just the keys we're interested in.
14.1.2 MultiIndex as extra dimension
The unstack() method will quickly convert a multiply indexed Series into a
conventionally indexed DataFrame:
pop_df = pop.unstack()
pop_df
Output:
pop_df.stack()
Output:
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
Output:
The work of creating the MultiIndex is done in the background.
pd.Series(data)
Output:
2010 37253956
2010 19378102
2010 25145561
dtype: int64
14.2.1 Explicit MultiIndex constructors
We can construct the MultiIndex from a simple list of arrays giving the index values
within each level:
Output:
We can construct it from a list of tuples giving the multiple index values of each
point:
Output:
Output:
Similarly, we can construct the MultiIndex directly using its internal encoding by
passing levels (a list of lists containing available index values for each level) and
labels (a list of lists that reference these labels):
Any of these objects can be passed as the index argument when creating a Series or
Dataframe, or be passed to the reindex method of an existing Series or DataFrame.
In a DataFrame, the rows and columns are completely symmetric, and just as the
rows can have multiple levels of indices, the columns can have multiple levels as
well.
Consider the following medical data:
names=['year', 'visit'])
names=['subject', 'type'])
data[:, ::2] *= 10
data += 37
health_data
Output:
we can index the top-level column by the person's name and get a full DataFrame
containing just that person's information:
health_data['Guido']
Output:
For complicated records containing multiple labeled measurements across
multiple times for many subjects (people, countries, cities, etc.) use of
hierarchical rows and columns can be convenient.
pop
Output:
state year
California 2000 33871648
2010 37253956
pop['California', 2000]
Output:
33871648
The MultiIndex also supports partial indexing, or indexing just one of the levels in
the index. The result is another Series, with the lower-level indices maintained:
pop['California']
Output:
year
2000 33871648
2010 37253956
dtype: int64
pop.loc['California':'New York']
Output:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64
With sorted indices, partial indexing can be performed on lower levels by passing an
empty slice in the first index:
pop[:, 2000]
Output:
state
California 33871648
Texas 20851820
dtype: int64
selection based on Boolean masks:
pop[['California', 'Texas']]
Output:
state year
California 2000 33871648
2010 37253956
Texas 2000 20851820
2010 25145561
dtype: int64
14.3.2 Multiply indexed DataFrames
A multiply indexed DataFrame behaves in a similar manner. Consider the toy
medical DataFrame
health_data
Output:
Columns are primary in a DataFrame, and the syntax used for multiply indexed
Series applies to the columns. For example, we can recover Guido's heart rate data
with a simple operation:
health_data['Guido', 'HR']
Output:
year visit
2013 1 32.0
2 50.0
2014 1 39.0
2 48.0
Name: (Guido, HR), dtype: float64
Also, as with the single-index case, we can use the loc, iloc, and ix indexers
health_data.iloc[:2, :2]
Output:
Output:
idx = pd.IndexSlice
Output:
Many of the MultiIndex slicing operations will fail if the index is not sorted.
data
Output:
char int
a 1 0.003001
2 0.164974
c 1 0.741650
2 0.569264
b 1 0.001693
2 0.526226
dtype: float64
try:
data['a':'b']
except KeyError as e:
print(type(e))
print(e)
Output:
<class 'KeyError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'
Partial slices and other similar operations require the levels in the MultiIndex to be in
sorted (i.e., lexographical) order. Pandas provides a number of convenience routines
to perform this type of sorting; examples are the sort_index() and sortlevel()
methods of the DataFrame
data = data.sort_index()
data
Output:
char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
c 1 0.741650
2 0.569264
dtype: float64
With the index sorted in this way, partial slicing will work as expected:
data['a':'b']
Output:
char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
dtype: float64
pop.unstack(level=1)
Output:
The opposite of unstack() is stack(), which here can be used to recover the original
series:
pop.unstack().stack()
Output:
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
Output:
The set_index method of the DataFrame returns a multiply indexed DataFrame:
pop_flat.set_index(['state', 'year'])
Output:
Pandas has built-in data aggregation methods such as mean(), sum(), and max(). For
hierarchically indexed data, these can be passed a level parameter that controls which
subset of the data the aggregate is computed on.
health_data
Output:
data_mean = health_data.mean(level='year')
data_mean
Output:
Using the axis keyword, we can take the mean among levels on the columns as well:
data_mean.mean(axis=1, level='type')
Output:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
Output:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
The first argument is a list or tuple of arrays to concatenate. Additionally, it takes an
axis keyword that allows us to specify the axis along which the result will be
concatenated:
array([[1, 2, 1, 2],
[3, 4, 3, 4]])
Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but
contains a number of options.
copy=True)
By default, the concatenation takes place row-wise within the DataFrame (i.e.,
axis=0). Like np.concatenate, pd.concat allows specification of an axis along which
concatenation will take place. Consider the following example:
We could have equivalently specified axis=1; here we've used the more intuitive
axis='col'.
15.1.1 Duplicate indices
One important difference between np.concatenate and pd.concat is that Pandas
concatenation preserves indices, even if the result will have duplicate indices.
Consider this example:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
Output:
To verify that the indices in the result of pd.concat() do not overlap, we can specify
the verify_integrity flag. With this set to True, the concatenation will raise an
exception if there are duplicate indices. Here is an example, where for clarity we'll
catch and print the error message:
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
Output:
ValueError: Indexes have overlapping values: [0, 1]
Sometimes the index itself does not matter and we would prefer it to simply be
ignored. This option can be specified using the ignore_index flag. With this set to
true, the concatenation will create a new integer index for the resulting Series:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
Output:
Another option is to use the keys option to specify a label for the data sources; the
result will be a hierarchically indexed series containing the data:
Output:
Output:
By default, the entries for which no data is available are filled with NA values. To
change this, we can specify one of several options for the join and join_axes
parameters of the concatenate function. By default, the join is a union of the input
columns (join='outer'), but we can change this to an intersection of the columns
using join='inner':
display('df5', 'df6',
Output:
Another option is to directly specify the index of the remaining columns using the
join_axes argument, which takes a list of index objects. Here we will specify that the
returned columns should be the same as those of the first input:
display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")
Output:
Unlike the append() and extend() methods of Python lists, the append() method in
Pandas does not modify the original object, instead it creates a new object with the
combined data. It also is not a very efficient method, because it involves creation of
a new index and data buffer. Thus, if we want to do multiple append operations, it is
better to build a list of DataFrames and pass them all at once to the concat()
function.
16. Combining Datasets: Merge and Join
display('df1', 'df2')
Output:
To combine this information into a single DataFrame, we can use the pd.merge()
function:
Output:
df3 = pd.merge(df1, df2)
df3
Output:
The order of entries in each column is not necessarily maintained: in this case, the
order of the "employee" column differs between df1 and df2 and the pd.merge()
function correctly accounts for this.
Many-to-one joins are joins in which one of the two key columns contains duplicate
entries. For the many-to-one case, the resulting DataFrame will preserve those
duplicate entries as appropriate. Consider the following example of a many-to-one
join:
Output:
The resulting DataFrame has an additional column with the "supervisor" information,
where the information is repeated in one or more locations as required by the inputs.
If the key column in both the left and right array contains duplicates, then the result
is a many-to-many merge.
Consider the following, where we have a DataFrame showing one or more skills
associated with a particular group. By performing a many-to-many join, we can
recover the skills associated with any individual person:
'spreadsheets', 'organization']})
Output:
16.2 Specification of the Merge Key
pd.merge() looks for one or more matching column names between the two inputs
and uses this as the key. However, often the column names will not match and
pd.merge() provides a variety of options for handling this.
We can explicitly specify the name of the key column using the on keyword, which
takes a column name or a list of column names:
Output:
This option works only if both the left and right DataFrames have the specified
column name.
At times we may want to merge two datasets with different column names. For
example, we may have a dataset in which the employee name is labeled as "name"
rather than "employee". In this case, we can use the left_on and right_on keywords
to specify the two column names:
Output:
We can use the index as the key for merging by specifying the left_index and/or
right_index flags in pd.merge():
display('df1a', 'df2a',
"pd.merge(df1a, df2a, left_index=True, right_index=True)")
Output:
If we want to mix indices and columns, we can combine left_index with right_on or
left_on with right_index to get the desired behavior:
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
Output:
Other options for the how keyword are 'outer', 'left', and 'right'. An outer join returns
a join over the union of the input columns, and fills in all missing values with NAs:
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")
Output:
The left join and right join return joins over the left entries and right entries,
respectively. For example:
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
Output:
The output rows correspond to the entries in the left input. Using how='right' works
in a similar manner.
Here the two input DataFrames have conflicting column names. Consider this
example:
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')
Output:
Because the output would have two conflicting column names, the merge function
automatically appends a suffix _x or _y to make the output columns unique. If these
defaults are inappropriate, it is possible to specify a custom suffix using the suffixes
keyword:
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')
Output:
These suffixes work in any of the possible join patterns and work also if there are
multiple overlapping columns.
Planets Data
Here we will use the Planets dataset, available via the Seaborn package. It gives
information on planets that astronomers have discovered around other stars.
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
Output:
(1035, 6)
planets.head()
Output:
This has some details on the 1,000+ extrasolar planets discovered up to 2014.
17.1 Simple Aggregation in Pandas
For a Pandas Series the aggregates return a single value:rng =
np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
Output:
0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64
ser.sum()
Output:
2.8119254917081569
ser.mean()
Output:
0.56238509834163142
For a DataFrame, by default the aggregates return results within each column:
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df
Output:
df.mean()
Output:
A 0.477888
B 0.443420
dtype: float64
By specifying the axis argument, we can instead aggregate within each row:
df.mean(axis='columns')
Output:
0 0.088290
1 0.513997
2 0.849309
3 0.406727
4 0.444949
dtype: float64
The method describe() computes several common aggregates for each column and
returns the result. We can use this on the Planets data for dropping rows with
missing values:
planets.dropna().describe()
Output:
The most basic split-apply-combine operation can be computed with the groupby()
method of DataFrames, passing the name of the desired key column:
df.groupby('key')
Output:
<pandas.core.groupby.DataFrameGroupBy object at 0x117272160>
what is returned is not a set of DataFrames, but a DataFrameGroupBy object.
● Column indexing
The GroupBy object supports column indexing in the same way as the DataFrame
and returns a modified GroupBy object. For example:
planets.groupby('method')
Output:
<pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8>
planets.groupby('method')['orbital_period']
Output:
<pandas.core.groupby.SeriesGroupBy object at 0x117272da0>
Here we have selected a particular Series group from the original DataFrame group
by reference to its column name. As with the GroupBy object, no computation is
done until we call some aggregate on the object:
planets.groupby('method')['orbital_period'].median()
Output:
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000
Name: orbital_period, dtype: float64
This gives an idea of the general scale of orbital periods (in days)
● Iteration over groups
The GroupBy object supports direct iteration over the groups, returning each group
as a Series or DataFrame:
for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
Output:
Astrometry shape=(2, 6)
Eclipse Timing Variations shape=(9, 6)
Imaging shape=(38, 6)
Microlensing shape=(23, 6)
Orbital Brightness Modulation shape=(3, 6)
Pulsar Timing shape=(5, 6)
Pulsation Timing Variations shape=(1, 6)
Radial Velocity shape=(553, 6)
Transit shape=(397, 6)
Transit Timing Variations shape=(4, 6)
● Dispatch methods
Through some Python class magic, any method not explicitly implemented by the
GroupBy object will be passed through and called on the groups, whether they are
DataFrame or Series objects. For example, we can use the describe() method of
DataFrames to perform a set of aggregations that describe each group in the data:
planets.groupby('method')['year'].describe().unstack()
Output:
17.2.2 Aggregate, filter, transform, apply
GroupBy objects have aggregate(), filter(), transform(), and apply() methods that
efficiently implement a variety of useful operations before combining the grouped
data.
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
Output:
● Aggregation
aggregate() method can take a string, a function, or a list and compute all the
aggregates at once.
df.groupby('key').aggregate(['min', np.median, max])
Output:
Another useful pattern is to pass a dictionary mapping column names to operations
to be applied on that column:
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
Output:
● Filtering
A filtering operation allows us to drop data based on the group properties. For
example, we might want to keep all groups in which the standard deviation is larger
than some critical value:
def filter_func(x):
return x['data2'].std() > 4
The filter function should return a Boolean value specifying whether the group
passes the filtering. Here because group A does not have a standard deviation
greater than 4, it is dropped from the result.
● Transformation
While aggregation must return a reduced version of the data, transformation can
return some transformed version of the full data to recombine. For such a
transformation, the output is the same shape as the input. A common example is to
center the data by subtracting the group-wise mean:
df.groupby('key').transform(lambda x: x - x.mean())
Output:
x['data1'] /= x['data2'].sum()
return x
display('df', "df.groupby('key').apply(norm_by_data2)")
Output:
17.2.3 Specifying the split key
● A list, array, series, or index providing the grouping keys
The key can be any series or list with a length matching that of the DataFrame. For
example:
L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')
Output:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')
Output:
• This immediately gives us some insight: overall, three of every four females
on board survived, while only one in five males survived!
• Using the vocabulary of GroupBy, we might proceed using something like this:
• we group by class and gender, select survival, apply a mean aggregate,
combine the resulting groups, and then unstack the hierarchical index to
reveal the hidden multidimensionality. In code:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
• This gives us a better idea of how both gender and class affected survival.
• This two-dimensional GroupBy is common enough that Pandas includes a
convenience routine, pivot_table, which succinctly handles this type of
multi-dimensional aggregation.
Here is the equivalent to the preceding operation using the pivot_table method of
DataFrames:
titanic.pivot_table('survived', index='sex', columns='class')
This is eminently more readable than the groupby approach, and produces the same
result.
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
Let's take a look at the freely available data on births in the United States, provided
by the Centers for Disease Control (CDC). This data can be found at
https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
# !curl -O
https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
births = pd.read_csv('data/births.csv')
Taking a look at the data, we see that it's relatively simple–it contains the number of
births grouped by date and gender:
births.head()
We can start to understand this data a bit more by using a pivot table. Let's add a
decade column, and take a look at male and female births as a function of decade:
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
We immediately see that male births outnumber female births in every decade. To
see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to
visualize the total number of births by year
%matplotlib inline
With a simple pivot table and plot() method, we can immediately see the annual
trend in births by gender. By eye, it appears that over the past 50 years male births
have outnumbered female births by around 5%.
We can now call a single method that will capitalize all the entries, while skipping
over any missing values:
names.str.capitalize()
Output:
0 Peter
1 Paul
2 None
3 Mary
4 Guido
dtype: object
Or Boolean values:
monte.str.startswith('T')
Output:
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool
Still others return lists or other compound values for each element:
monte.str.split()
Output:
0 [Graham, Chapman]
1 [John, Cleese]
2 [Terry, Gilliam]
3 [Eric, Idle]
4 [Terry, Jones]
5 [Michael, Palin]
dtype: object
19.2.2 Methods using regular expressions
There are several methods that accept regular expressions to examine the content
of each string element, and follow some of the API conventions of Python's built-in
re module:
we can extract the first name from each by asking for a contiguous group of
characters at the beginning of each element:
monte.str.extract('([A-Za-z]+)', expand=False)
Output:
0 Graham
1 John
2 Terry
3 Eric
4 Terry
5 Michael
dtype: object
we can find all names that start and end with a consonant, making use of the
start-of-string (^) and end-of-string ($) regular expression characters:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Output:
0 [Graham Chapman]
1 []
2 [Terry Gilliam]
3 []
4 [Terry Jones]
5 [Michael Palin]
dtype: object
Output:
0 Chapman
1 Cleese
2 Gilliam
3 Idle
4 Jones
5 Palin
dtype: object
● Indicator variables
The get_dummies() method is useful when our data has a column containing some
sort of coded indicator. For example, we might have a dataset that contains
information in the form of codes, such as A="born in America," B="born in the
United Kingdom," C="likes cheese," D="likes spam":
● Time stamps reference particular moments in time (e.g., July 4th, 2015 at
7:00am).
● Time intervals and periods reference a length of time between a particular
beginning and end point; for example, the year 2015. Periods usually
reference a special case of time intervals in which each interval is of uniform
length and does not overlap (e.g., 24 hour-long periods comprising days).
● Time deltas or durations reference an exact length of time (e.g., a duration
of 22.56 seconds).
While the time series tools provided by Pandas tend to be the most useful for data
science applications, it is helpful to see their relationship to other packages used in
Python.
● Native Python dates and times: datetime and dateutil
Python's basic objects for working with dates and times reside in the built-in
datetime module. Along with the third-party dateutil module, we can use it to quickly
perform a host of useful functionalities on dates and times. For example, we can
manually build a date using the datetime type:
Output:
datetime.datetime(2015, 7, 4, 0, 0)
Or using the dateutil module, we can parse dates from a variety of string formats:
date
Output:
datetime.datetime(2015, 7, 4, 0, 0)
Once we have a datetime object, we can print the day of the week:
date.strftime('%A')
Output:
'Saturday'
("%A") is one of the standard string format codes for printing dates.
The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of
dates to be represented very compactly. The datetime64 requires a very specific
input format:
import numpy as np
date = np.array('2015-07-04', dtype=np.datetime64)
date
Output:
array(datetime.date(2015, 7, 4), dtype='datetime64[D]')
• Once we have this date formatted, we can do vectorized operations on it:
date + np.arange(12)
Output:
array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
'2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
'2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'], dtype='datetime64[D]')
NumPy will infer the desired unit from the input; for example, here is a day-based
datetime:
np.datetime64('2015-07-04')
Output:
numpy.datetime64('2015-07-04')
Here is a minute-based datetime:
np.datetime64('2015-07-04 12:00')
Output:
numpy.datetime64('2015-07-04T12:00')
The time zone is automatically set to the local time on the computer executing the
code. We can force any desired fundamental unit using one of many format codes;
for example, here we'll force a nanosecond-based time:
np.datetime64('2015-07-04 12:59:59.50', 'ns')
Output:
numpy.datetime64('2015-07-04T12:59:59.500000000')
The following table lists the available format codes along with the relative and
absolute time spans that they can encode:
For the types of data we see in the real world, a useful default is datetime64[ns], as
it can encode a useful range of modern dates with a suitably fine precision.
Pandas builds upon all the tools to provide a Timestamp object, which combines the
ease-of-use of datetime and dateutil with the efficient storage and vectorized
interface of numpy.datetime64. From a group of these Timestamp objects, Pandas
can construct a DatetimeIndex that can be used to index data in a Series or
DataFrame.
We can parse a flexibly formatted string date and use format codes to output the
day of the week:
import pandas as pd
date = pd.to_datetime("4th of July, 2015")
date
Output:
Timestamp('2015-07-04 00:00:00')
date.strftime('%A')
Output:
'Saturday'
With the Pandas time series tools, we can index data by timestamps. For example,
we can construct a Series object that has time indexed data:
index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
'2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data
Output:
2014-07-04 0
2014-08-04 1
2015-07-04 2
2015-08-04 3
dtype: int64
Now that we have this data in a Series, we can make use of any of the Series
indexing patterns.
data['2014-07-04':'2015-07-04']
Output:
2014-07-04 0
2014-08-04 1
2015-07-04 2
dtype: int64
There are additional special date-only indexing operations, such as passing a year
to obtain a slice of all data from that year:
data['2015']
Output:
2015-07-04 2
2015-08-04 3
dtype: int64
The fundamental Pandas data structures for working with time series data:
● For time stamps, Pandas provides the Timestamp type. It is essentially a
replacement for Python's native datetime, but is based on the more efficient
numpy.datetime64 data type. The associated Index structure is
DatetimeIndex.
● For time Periods, Pandas provides the Period type. This encodes a
fixed-frequency interval based on numpy.datetime64. The associated index
structure is PeriodIndex.
● For time deltas or durations, Pandas provides the Timedelta type. Timedelta is
a more efficient replacement for Python's native datetime.timedelta type and
is based on numpy.timedelta64. The associated index structure is
TimedeltaIndex.
The most fundamental of these date/time objects are the Timestamp and
DatetimeIndex objects.
While these class objects can be invoked directly, it is more common to use the
pd.to_datetime() function which can parse a wide variety of formats.
Passing a single date to pd.to_datetime() yields a Timestamp, passing a series of
dates by default yields a DatetimeIndex:
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
'2015-Jul-6', '07-07-2015', '20150708'])
dates
Output:
pd.date_range('2015-07-03', '2015-07-10')
Output:
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
'2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
dtype='datetime64[ns]', freq='D')
Alternatively, the date range can be specified not with a start and endpoint, but with
a startpoint and a number of periods:
pd.date_range('2015-07-03', periods=8)
Output:
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
'2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
dtype='datetime64[ns]', freq='D')
The spacing can be modified by altering the freq argument which defaults to D. For
example, here we will construct a range of hourly timestamps:
pd.date_range('2015-07-03', periods=8, freq='H')
Output:
DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
'2015-07-03 02:00:00', '2015-07-03 03:00:00',
'2015-07-03 04:00:00', '2015-07-03 05:00:00',
'2015-07-03 06:00:00', '2015-07-03 07:00:00'],
dtype='datetime64[ns]', freq='H')
Fundamental to these Pandas time series tools is the concept of a frequency or date
offset. Just as the D (day) and H (hour) codes, we can use such codes to specify any
desired frequency spacing. The following table summarizes the main codes available:
The monthly, quarterly, and annual frequencies are all marked at the end of the
specified period. By adding an S suffix to any of these, they instead will be marked
at the beginning:
Additionally, we can change the month used to mark any quarterly or annual code
by adding a three-letter month code as a suffix:
In the same way, the split-point of the weekly frequency can be modified by adding
a three-letter weekday code:
On top of this, codes can be combined with numbers to specify other frequencies.
For example, for a frequency of 2 hours 30 minutes, we can combine the hour (H)
and minute (T) codes as follows:
pd.timedelta_range(0, periods=9, freq="2H30T")
Output:
TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00',
'12:30:00', '15:00:00', '17:30:00', '20:00:00'],
dtype='timedelta64[ns]', freq='150T')
All of these short codes refer to specific instances of Pandas time series offsets,
which can be found in the pd.tseries.offsets module. For example, we can create a
business day offset directly as follows:
from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())
Output:
DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-06',
'2015-07-07'],
dtype='datetime64[ns]', freq='B')
The ability to use dates and times as indices to intuitively organize and access data is
an important piece of the Pandas time series tools. The benefits of indexed data in
general (automatic alignment during operations, intuitive data slicing and access,
etc.) still apply and Pandas provides several additional time series-specific
operations.
One common need for time series data is resampling at a higher or lower
frequency. This can be done using the resample() method or the much simpler
asfreq() method. The primary difference between the two is that resample() is
fundamentally a data aggregation, while asfreq() is fundamentally a data
selection.
Taking the Google closing price, let's compare what the two return when we
down-sample the data. Here we will resample the data at the end of business
year:
goog.plot(alpha=0.5, style='-')
goog.resample('BA').mean().plot(style=':')
goog.asfreq('BA').plot(style='--');
loc='upper left');
Output:
At each point, resample reports the average of the previous year, while asfreq reports
the value at the end of the year.
For up-sampling, resample() and asfreq() are largely equivalent. In this case, the
default for both methods is to leave the up-sampled points empty, that is, filled with
NA values. Just as with the pd.fillna() function, asfreq() accepts a method argument
to specify how values are imputed. Here, we will resample the business day data at
a daily frequency (i.e., including weekends):
data = goog.iloc[:10]
data.asfreq('D').plot(ax=ax[0], marker='o')
ax[1].legend(["back-fill", "forward-fill"]);
Output:
The top panel is the default: non-business days are left as NA values and do not
appear on the plot.
The bottom panel shows the differences between two strategies for filling the gaps:
forward-filling and backward-filling.
20.6.2 Time-shifts
Another common time series-specific operation is shifting of data in time. Pandas has
two closely related methods for computing this: shift() and tshift(). The difference
between them is that shift() shifts the data, while tshift() shifts the index. In both
cases, the shift is specified in multiples of the frequency.
goog.plot(ax=ax[0])
goog.shift(900).plot(ax=ax[1])
goog.tshift(900).plot(ax=ax[2])
local_max = pd.to_datetime('2007-11-05')
ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].legend(['shift(900)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
shift(900) shifts the data by 900 days, pushing some of it off the end of the graph
(and leaving NA values at the other end), while tshift(900) shifts the index values by
900 days.
A common context for this type of shift is in computing differences over time. For
example, we use shifted values to compute the one-year return on investment for
Google stock over the course of the dataset:
ROI.plot()
Output:
This helps us to see the overall trend in Google stock.
20.6.3 Rolling windows
Rolling statistics are a third type of time series-specific operation implemented by
Pandas. These can be accomplished via the rolling() attribute of Series and
DataFrame objects, which returns a view similar to what we saw with the groupby
operation. This rolling view makes available a number of aggregation operations by
default.
For example, the one-year centered rolling mean and standard deviation of the
Google stock prices:
rolling = goog.rolling(365, center=True)
data = pd.DataFrame({'input': goog,
'one-year rolling_mean': rolling.mean(),
'one-year rolling_std': rolling.std()})
ax = data.plot(style=['-', '--', ':'])
ax.lines[0].set_alpha(0.3)
Output:
As with group-by operations, the aggregate() and apply() methods can be used for
custom rolling computations.
The power of the PyData stack is built upon the ability of NumPy and Pandas to push
basic operations into C via an intuitive syntax: examples are vectorized/broadcasted
operations in NumPy, and grouping-type operations in Pandas. While these
abstractions are efficient and effective for many common use cases, they often rely
on the creation of temporary intermediate objects, which can cause undue overhead
in computational time and memory use.
Pandas includes some experimental tools that allow us to directly access C-speed
operations without costly allocation of intermediate arrays. These are the eval() and
query() functions which rely on the Numexpr package.
NumPy and Pandas support fast vectorized operations; for example, when adding
the elements of two arrays:
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y
Output:
This is much faster than doing the addition via a Python loop or comprehension:
Output:
But this abstraction can become less efficient when computing compound
expressions. For example, consider the following expression:
import numexpr
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
for i in range(4))
To compute the sum of all four DataFrames using the typical Pandas approach, we
can just write the sum:
Output:
The same result can be computed via pd.eval by constructing the expression as a
string:
● Comparison operators
● Bitwise operators
pd.eval() supports access to object attributes via the obj.attr syntax and indexes via
the obj[index] syntax:
● Other operations
Other operations such as function calls, conditional statements, loops, and other
more involved constructs are currently not implemented in pd.eval(). If we want to
execute these more complicated types of expressions, we can use the Numexpr
library itself.
Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method
that works in similar ways. The benefit of the eval() method is that columns can be
referred to by name. We'll use this labeled array as an example:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()
Output:
Using pd.eval() as above, we can compute expressions with the three columns like
this:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)
Output:
True
The DataFrame.eval() method allows much more succinct evaluation of expressions
with the columns:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)
Output:
True
here we treat column names as variables within the evaluated expression.
We can use df.eval() to create a new column 'D' and assign to it a value computed
from the other columns:
df.eval('D = (A + B) / C', inplace=True)
df.head()
Output:
When considering whether to use these functions, there are two considerations:
computation time and memory use. Memory use is the most predictable aspect.
Every compound expression involving NumPy arrays or Pandas DataFrames will
result in implicit creation of temporary arrays: For example, this:
x = df[(df.A < 0.5) & (df.B < 0.5)]
Is roughly equivalent to this:
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]
2 Computation on Arrays
https://www.youtube.com/watch?v=QD6IBF0Hic4
3 Indexing https://www.youtube.com/watch?v=WpXH4PzDtYA
3. Using time series data, Visualize the Seattle Bicycle Counts. (CO2,K3)
PART-A Q&A UNIT-II
11. PART A : Q & A : UNIT – II
1. How the series object can be modified ? (CO2,K3)
Series objects can be modified with a dictionary-like syntax. Just as we can extend a
dictionary by assigning to a new key, we can extend a Series by assigning to a new
index value.
The first sentinel value used by Pandas is None, a Python singleton object that is
often used for missing data in Python code. Because it is a Python object, None
cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data
type 'object' i.e. arrays of Python objects.
The method describe() computes several common aggregates for each column and
returns the result. We can use this method on the dataset for dropping rows with
missing values.
11. PART A : Q & A : UNIT – II
6. What is split, apply and combine? (CO2,K3)
● The split step involves breaking up and grouping a data frame depending on
the value of the specified key.
● The apply step involves computing some function usually an aggregate,
transformation, or filtering within the individual groups.
● The combine step merges the results of these operations into an output
array.
Numexpr evaluates the expression in a way that does not use full-sized temporary
arrays and can be much more efficient than NumPy, especially for large arrays. The
Pandas eval() and query() tools are conceptually similar and depend on the
Numexpr package.
12. PART B QUESTIONS : UNIT – II
1. Explain the creation of multidimensional arrays with examples in Numpy.
( CO2,K3)
2. Write a short notes on the following ( CO2 , K2)
i. Indexing of Ndarrays
NPTEL : https://onlinecourses.nptel.ac.in/noc21_cs69/preview?
coursera : https://www.coursera.org/learn/python-data-analysis
Udemy : https://www.udemy.com/topic/data-science/
Mooc : https://mooc.es/course/introduction-to-data-science-in-python/
edx : https://learning.edx.org/course/course-v1:Microsoft+DAT208x+2T2016/home
14. REAL TIME APPLICATIONS
3. Stock Prediction The stock market is extremely volatile. However, that doesn’t
mean that it cannot be predicted. With the help of Pandas and a few other libraries
like NumPy and matplotlib, we can easily make models which can predict how the
stock markets turn out.
This is possible because there is a lot of previous data of stocks which tells us about
how they behave. And by learning these data of stocks, a model can easily predict
the next move to be taken with some accuracy. Not only this, but people can also
automate buying and selling of stocks with the help of such prediction models.
4. Neuroscience Understanding the nervous system has always been in the minds
of humankind because there are a lot of potential mysteries about our bodies which
we haven’t solved as of yet. Machine learning has helped this field immensely with
the help of the various applications of Pandas. Again, the data manipulation
capabilities of Pandas have played a major role in compiling a huge amount of data
which has helped neuroscientists in understanding trends that are followed inside
our bodies and the effect of various things on our entire nervous system.
5. Statistics Pure maths itself has made much progress with the various
applications of Pandas. Since Statistic deals with a lot of data, a library like Pandas
which deals with data handling has helped in a lot of different ways. The functions of
mean, median and mode are just very basic ones which help in performing statistical
calculations. There are a lot of other complex functions associated with statistics and
pandas plays a huge role in these so as to bring perfect results.
6. Advertising Advertising has taken a huge leap in the 21st Century. Nowadays
advertising has become very personalized which helps companies to get more and
more customers. This again has been possible only because of the likes of Machine
Learning and Deep Learning. Models going through customer data learn to
understand what exactly the customer wants, providing companies with great
advertisement ideas. There are many applications of Pandas in this. The customer
data often rendered with the help of this library, and a lot of functions present in
Pandas also help.
15. CONTENTS BEYOND SYLLABUS : UNIT – II
Distplots
Distplot stands for distribution plot, it takes as input an array and plots a curve
corresponding to the distribution of points in the array.
Plotting a Displot
TEXTBOOKS:
REFERENCES:
2. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques",
3rd Edition, Morgan Kaufmann, 2012.
3. Samir Madhavan, Mastering Python for Data Science, Packt Publishing, 2015
6. Peter Bruce, Andrew Bruce, "Practical Statistics for Data Scientists: 50 Essential
E-Book links:
1. https://drive.google.com/file/d/1HoGVyZqLTQj0aA4THA__D4jJ74czxEKH/view
?usp=sharing
2. https://drive.google.com/file/d/1vJfX5xipCHZOleWfM9aUeK8mwsal6Il1/view?u
sp=sharing
3. https://drive.google.com/file/d/1aU2UKdLxLdGpmI73S1bifK8JPiMXlpoS/view?
usp=sharing
18. MINI PROJECT SUGGESTION
Mini Projects
a) Recommendation system
b) Credit Card Fraud Detection
c) Fake News Detection
d) Customer Segmentation
e) Sentiment Analysis
f) Recommender Systems
g) Emotion Recognition
h) Stock Market Prediction
i) Email classification
j) Tweets classification
k) Uber Data Analysis
l) Social Network Analysis
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not the
intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance
on the contents of this information is strictly prohibited.