0% found this document useful (0 votes)
10 views38 pages

Data Manipulation With Pandas

The document introduces Pandas, a data manipulation library in Python, highlighting its three fundamental data structures: Series, DataFrame, and Index. It explains the creation and functionality of Series as a one-dimensional array with labeled indices, and DataFrame as a two-dimensional array with flexible row and column labels. Additionally, it discusses the immutability and ordered set properties of the Index object, along with the ability to perform element-wise operations using NumPy's universal functions.

Uploaded by

22eg112a15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views38 pages

Data Manipulation With Pandas

The document introduces Pandas, a data manipulation library in Python, highlighting its three fundamental data structures: Series, DataFrame, and Index. It explains the creation and functionality of Series as a one-dimensional array with labeled indices, and DataFrame as a two-dimensional array with flexible row and column labels. Additionally, it discusses the immutability and ordered set properties of the Index object, along with the ability to perform element-wise operations using NumPy's universal functions.

Uploaded by

22eg112a15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Data Manipulation with Pandas

Introducing Pandas Objects


• Pandas objects are enhanced versions of NumPy structured arrays in which
the rows and columns are identified with labels rather than simple integer
indices.
• Pandas provides a host of useful tools, methods, and functionality on top of
the basic data structures.
• Three fundamental Pandas data structures:
1. Series 2. DataFrame and 3.Index
In[1]: import numpy as np
import pandas as pd
1. The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from
a list or array as follows:
In[2]:data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
Out[2]: 0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64

the Series wraps both a sequence of values and a sequence of indices, which we can
access with the values and index attributes. The values are simply a familiar
NumPy array:
In[3]: data.values
Out[3]: array([ 0.25, 0.5 , 0.75, 1. ])

The index is an array-like object of type pd.Index


In[4]: data.index
Out[4]: RangeIndex(start=0, stop=4, step=1)

Data can be accessed by the associated index via the familiar Python square-bracket notation:
In[5]: data[1]
Out[5]: 0.5
In[6]: data[1:3]
Out[6]: 1 0.50
2 0.75
dtype: float64

The Pandas Series is much more general and flexible than the one-dimensional
NumPy array that it emulates.
Series as generalized NumPy array
Series object is basically inter‐changeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the NumPy array has an
implicitly defined integer index used to access the values, the Pandas Series has an
explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities.
For example, the index need not be an integer, but can consist of values of any desired
type. For example, if we wish, we can use strings as an index:
In[7]: data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
Out[7]: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64

And the item access works as expected:


In[8]: data['b']
Out[8]: 0.5

We can even use noncontiguous or nonsequential indices:


In[9]: data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2,5,3, 7])
data
Out[9]: 2 0.25
5 0.50
3 0.75
7 1.00
dtype: float64
In[10]: data[5]
Out[10]: 0.5

Series as specialized dictionary


A Pandas Series is like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a
Series is a structure that maps typed keys to a set of typed values.
We can make the Series-as-dictionary analogy even more clear by constructing a
Series object directly from a Python dictionary:
In[11]: population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
Out[11]: California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64

By default, a Series will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:
In[12]: population['California']
Out[12]: 38332521

Series also supports array-style operations such as slicing:


In[13]: population['California':'Illinois']
Out[13]: California 38332521
Florida 19552860
Illinois 12882135
dtype: int64

Constructing Series objects


We’ve already seen a few ways of constructing a Pandas Series from scratch; all
of them are some version of the following:
>>> pd.Series(data, index=index)

where index is an optional argument, and data can be one of many entities.
For example, data can be a list or NumPy array, in which case index defaults to
an integer sequence:
In[14]: pd.Series([2, 4, 6])
Out[14]: 0 2
1 4
2 6
dtype: int64

Data can be a scalar, which is repeated to fill the specified index:


In[15]: pd.Series(5, index=[100, 200, 300])
Out[15]: 100 5
200 5
300 5
dtype: int64

data can be a dictionary, in which index defaults to the sorted dictionary keys:
In[16]: pd.Series({2:'a', 1:'b', 3:'c'})
Out[16]: 1 b
2 a
3 c
4 dtype:
object
the index can be explicitly set if a different result is preferred:
In[17]: pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
Out[17]: 3 c
2 a dtype:
object

Notice that in this case, the Series is populated only with the explicitly identified
keys.

2. The Pandas DataFrame Object


The DataFrame can be thought of either as a generalization of a NumPy array, or as
a specialization of a Python dictionary.

DataFrame as a generalized NumPy array


If a Series is an analog of a one-dimensional array with flexible indices, a
DataFrame is an analog of a two-dimensional array with both flexible row indices
and flexible column names.
DataFrame as a sequence of aligned Series objects, “aligned” means they share the
same index.
a new Series listing the area of each of the five states discussed in the previous
section:
In[18]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}

area = pd.Series(area_dict)
area
Out[18]: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64

Now that we have this along with the population Series from before, we can use a
dictionary to construct a single two-dimensional object containing this information:
In[19]: states = pd.DataFrame({'population': population,
'area': area})
states
Out[19]: area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

Like the Series object, the DataFrame has an index attribute that gives access to
the index labels:
In[20]: states.index
Out[20]:
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

Additionally, the DataFrame has a columns attribute, which is an Index object


holding the column labels:
In[21]: states.columns
Out[21]: Index(['area', 'population'], dtype='object')

Thus the DataFrame can be thought of as a generalization of a two-dimensional


NumPy array, where both the rows and columns have a generalized index for access‐
ing the data.

DataFrame as specialized dictionary


A dictionary maps a key to a value, a DataFrame maps a column name to a Series
of column data. For example, asking for the 'area' attribute returns the Series
object containing the areas we saw earlier:
In[22]: states['area']
Out[22]: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

Constructing DataFrame objects


A Pandas DataFrame can be constructed in a variety of ways.

From a single Series object. A DataFrame is a collection of Series objects,


and a single-column DataFrame can be constructed from a single Series:
In[23]: pd.DataFrame(population, columns=['population'])
Out[23]: population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193

From a list of dicts. Any list of dictionaries can be made into a DataFrame.
Use a simple list comprehension to create some data:
In[24]: data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
Out[24]: a b
0 0 0
1 1 2
2 2 4
Even if some keys in the dictionary are missing, Pandas will fill them in with NaN
(i.e., “not a number”) values:
In[25]: pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
Out[25]: a b c
0 1.0 2 NaN
1 NaN 3 4.0

From a dictionary of Series objects. a DataFrame can be constructed from


a dictionary of Series objects as well:
In[26]: pd.DataFrame({'population': population,
'area': area})
Out[26]: area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

From a two-dimensional NumPy array. Given a two-dimensional array of


data, we can create a DataFrame with any specified column and index names. If
omitted, an integer index will be used for each:
In[27]: pd.DataFrame(np.random.rand(3,

2), columns=['foo',

'bar'], index=['a',

'b', 'c'])

Out[27]: foo bar


a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718

From a NumPy structured array. A Pandas DataFrame operates much like


a structured array, and can be created directly from one:
In[28]: A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
Out[28]: array([(0, 0.0), (0, 0.0), (0, 0.0)],
dtype=[('A', '<i8'), ('B', '<f8')])

In[29]: pd.DataFrame(A)
Out[29]: A B
0 0 0.0
1 0 0.0
2 0 0.0

3. The Pandas Index Object


Both the Series and DataFrame objects contain an explicit index that lets you reference
and modify data. This Index object is an interesting structure in itself, and it can be
thought of either as an immutable array or as an ordered set (technically a multiset, as
Index objects may contain repeated values). Those views have some interesting
consequences in the operations available on Index objects. As a simple example, let’s
construct an Index from a list of integers:
In[30]: ind = pd.Index([2, 3, 5, 7, 11])
ind
Out[30]: Int64Index([2, 3, 5, 7, 11], dtype='int64')

Index as immutable array


The Index object in many ways operates like an array. For example, we can use
standard Python indexing notation to retrieve values or slices:
In[31]: ind[1]
Out[31]: 3
In[32]: ind[::2]
Out[32]: Int64Index([2, 5, 11], dtype='int64')

Index objects also have many of the attributes familiar from NumPy arrays:
In[33]: print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64

One difference between Index objects and NumPy arrays is that indices are
immutable—that is, they cannot be modified via the normal means:
In[34]: ind[1] = 0
---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-34-40e631c82e8a> in <module>()
----> 1 ind[1] = 0

/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py ...
1243
1244 def __setitem__(self, key, value):
-> 1245 raise TypeError("Index does not support mutable
operations") 1246
1247 def __getitem__(self, key):

TypeError: Index does not support mutable operations

This immutability makes it safer to share indices between multiple DataFrames and
arrays, without the potential for side effects from inadvertent index modification.

Index as ordered set


The Index object follows many of the conventions used by Python’s built-in set
data structure, so that unions, intersections, differences, and other combinations can be
computed in a familiar way:
In[35]: indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In[36]: indA & indB # intersection

Out[36]: Int64Index([3, 5, 7], dtype='int64')

In[37]: indA | indB # union

Out[37]: Int64Index([1, 2, 3, 5, 7, 9, 11],


dtype='int64')

In[38]: indA ^ indB # symmetric difference

Out[38]: Int64Index([1, 2, 9, 11], dtype='int64')

These operations may also be accessed via object methods—for example,


indA.intersection(indB)

Operating on Data in Pandas


One of the essential pieces of NumPy is the ability to perform quick element-wise
operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and
with more sophisticated operations (trigonometric functions, exponential and loga‐
rithmic functions, etc.). Pandas inherits much of this functionality from NumPy, and
the ufuncs Universal Functions.
We see that there are well-defined operations between one-dimensional Series
structures and two-dimensional DataFrame structures.

Ufuncs: Index Preservation


Because Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series and DataFrame objects. Let’s start by defining a simple Series
and DataFrame on which to demonstrate this:
In[1]: import pandas as pd
import numpy as np
In[2]: rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
Out[2]: 0 6
1 3
2 7
3 4
dtype: int64

In[3]: df = pd.DataFrame(rng.randint(0, 10, (3, 4)),


columns=['A', 'B', 'C', 'D'])
df
Out[3]: A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4

If we apply a NumPy ufunc on either of these objects, the result will be another Pan‐
das object with the indices preserved:
In[4]: np.exp(ser)

Out[4]: 0 403.428793
1 20.085537
2 1096.633158
3 54.598150
dtype: float64

Or, for a slightly more complex calculation:


In[5]: np.sin(df * np.pi / 4)
Out[5]: A B C D
0 -1.000000 7.071068e-01 1.000000 -1.000000e+00
1 -0.707107 1.224647e-16 0.707107 -7.071068e-01
2 -0.707107 1.000000e+00 -0.707107 1.224647e-16

UFuncs: Index Alignment


For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation. This is very convenient when you
are working with incomplete data

Index alignment in Series


As an example, suppose we are combining two different data sources, and find only
the top three US states by area and the top three US states by population:
In[6]: area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')

Let’s see what happens when we divide these to compute the population density:
In[7]: population / area
Out[7]: Alaska NaN
California 90.413926
New York NaN
Texas 38.018740
dtype: float64
The resulting array contains the union of indices of the two input arrays, which we
could determine using standard Python set arithmetic on these indices:
In[8]: area.index | population.index
Out[8]: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item for which one or the other does not have an entry is marked with NaN, or “Not
a Number. This index matching is implemented this way for any of Python’s built-in
arithmetic expressions; any missing val‐ ues are filled in with NaN by default:
In[9]: A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
Out[9]: 0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64

If using NaN values is not the desired behavior, we can modify the fill value using
appropriate object methods in place of the operators. For example, calling A.add(B)
is equivalent to calling A + B, but allows optional explicit specification of the fill
value for any elements in A or B that might be missing:
In[10]: A.add(B, fill_value=0)
Out[10]: 0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64

Index alignment in DataFrame


A similar type of alignment takes place for both columns and indices when you are
performing operations on DataFrames:
In[11]: A = pd.DataFrame(rng.randint(0, 20, (2,
2)), columns=list('AB'))
A
Out[11]: A B
0 1 11
1 5 1
In[12]: B = pd.DataFrame(rng.randint(0, 10, (3,
3)), columns=list('BAC'))
B
Out[12]: B A C
0 4 0 9
1 5 8 0
2 9 2 6

A
B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN

Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in
place of missing entries. Here we’ll fill with the mean of all values in A (which we
compute by first stacking the rows of A):
In[14]: fill = A.stack().mean()
A.add(B, fill_value=fill)
Out[14]: A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5

Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in
place of missing entries. Here we’ll fill with the mean of all values in A (which we
compute by first stacking the rows of A):
In[14]: fill = A.stack().mean()
A.add(B, fill_value=fill)
Out[14]: A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5

Table 3-1 lists Python operators and their equivalent Pandas object methods.

Table 3-1. Mapping between Python operators and Pandas methods


Python operator Pandas method(s)

+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
**pow()

Ufuncs: Operations Between DataFrame and Series


When you are performing operations between a DataFrame and a Series, the index
and column alignment is similarly maintained. Operations between a DataFrame and
a Series are similar to operations between a two-dimensional and one-dimensional
NumPy array. Consider one common operation, where we find the difference of a two-
dimensional array and one of its rows:
In[15]: A = rng.randint(10, size=(3, 4))
A
Out[15]: array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
In[16]: A - A[0]
Out[16]: array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])

According to NumPy’s broadcasting rules, subtraction between a two-dimensional


array and one of its rows is applied row-wise.
In Pandas, the convention similarly operates row-wise by default:
In[17]: df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
Out[17]: Q R S T
0 0 0 0 0
1 -1 -2 2 4
2 3 -7 1 4

If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword:
In[18]: df.subtract(df['R'], axis=0)
Out[18]: Q R S T
0 -5 0 -6 -4
1 -4 0 -2 2
2 5 0 2 7

Note that these DataFrame/Series operations, like the operations discussed


before, will automatically align indices between the two elements:
In[19]: halfrow = df.iloc[0, ::2]
halfrow
Out[19]: Q 3
S 2
Name: 0, dtype: int64
In[20]: df - halfrow
Out[20]: Q R S T
0 0.0 NaN 0.0 NaN
1 -1.0 NaN 2.0 NaN
2 3.0 NaN 1.0 NaN

This preservation and alignment of indices and columns means that operations on data
in Pandas will always maintain the data context, which prevents the types of silly errors
that might come up when you are working with heterogeneous and/or mis‐ aligned
data in raw NumPy arrays.
Data Indexing and Selection
Learning for accessing and modifying values in Pandas Series and DataFrame objects.

Data Selection in Series


A Series object acts in many ways like a one-dimensional NumPy array, and in
many ways like a standard Python dictionary.
This help us to understand the patterns of data indexing and selection in these arrays.

Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to
a collection of values:
In[1]: import pandas as pd
data = pd.Series([0.25, 0.5,
0.75, 1.0],
index=['a', 'b', 'c',
'd'])
data
Out[1]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
In[2]: data['b']
Out[2]: 0.5
In[3]: 'a' in data
Out[3]: True
In[4]: data.keys()
Out[4]: Index(['a', 'b', 'c', 'd'], dtype='object')
In[5]: list(data.items())
Out[5]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d',
1.0)]
A Series can be assigned to a new index value:
In[6]: data['e'] = 1.25
data
Out[6]:
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64

Series as one-dimensional array


A Series builds slices, masking, and fancy indexing.
In[7]: # slicing by explicit index
data['a':'c']
Out[7]:
a 0.25
b 0.50
c 0.75
dtype: float64
In[8]: # slicing by implicit integer index
data[0:2]
Out[8]:
a 0.25
b 0.50
dtype: float64
In[9]: # masking
data[(data > 0.3) & (data < 0.8)]
Out[9]:
b 0.50
c 0.75
dtype: float64
In[10]: # fancy
indexing
data[['a', 'e']]
Out[10]:
a 0.25
e 1.25
dtype: float64

Indexers: loc, iloc, and ix


These slicing and indexing conventions can be a source of confusion. For example, if
your Series has an explicit integer index, an indexing operation such as data[1]
will use the explicit indices, while a slicing operation like data[1:3] will use the
implicit Python-style index.
In[11]: data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
Out[11]: 1 a
3 b
5 c dtype:
object

In[12]: # explicit index when indexing


data[1]
Out[12]: 'a'
In[13]: # implicit index when slicing
data[1:3]
Out[13]: 3 b
5 c dtype:
object

Because of this potential confusion in the case of integer indexes, Pandas provides some
special indexer attributes that explicitly expose certain indexing schemes. These
are not functional methods, but attributes that expose a particular slicing interface to
the data in the Series.
First, the loc attribute allows indexing and slicing that always references the
explicit index:
In[14]: data.loc[1]
Out[14]: 'a'
In[15]: data.loc[1:3]
Out[15]: 1 a
3 b dtype:
object

The iloc attribute allows indexing and slicing that always references the implicit
Python-style index:
In[16]: data.iloc[1]
Out[16]: 'b'
In[17]: data.iloc[1:3]
Out[17]: 3 b
5 c dtype:
object

A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equiva‐
lent to standard []-based indexing. The purpose of the ix indexer will become more
apparent in the context of DataFrame objects, which we will discuss in a moment.
One guiding principle of Python code is that “explicit is better than implicit.” The
explicit nature of loc and iloc make them very useful in maintaining clean and
readable code; especially in the case of integer indexes, I recommend using these both
to make code easier to read and understand, and to prevent subtle bugs due to the mixed
indexing/slicing convention.

Data Selection in DataFrame


A DataFrame acts a two-dimensional or structured array, like a dictionary of Series
structures sharing the same index.

DataFrame as a dictionary
Let’s return to our example of areas and populations of states:
In[18]: area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
Out[18]: area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

# dictionary-style indexing of the column name:


In[19]: data['area']
Out[19]: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

# attribute-style access with column names that are strings:


In[20]: data.area
Out[20]: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

attribute-style accesses the exact same object as the dictionary-style access:


In[21]: data.area is data['area']
Out[21]: True

it does not work for all cases! For example, if the column names are not strings, or if
the column names conflict with methods of the DataFrame, this attribute-style access
is not possible. For example, the DataFrame has a pop() method, so data.pop
will point to this rather than the "pop" column:
In[22]: data.pop is data['pop']
Out[22]: False
In particular, you should avoid the temptation to try column assignment via attribute
(i.e., use data['pop'] = z rather than data.pop = z).
Like with the Series objects discussed earlier, this dictionary-style syntax can also
be used to modify the object, in this case to add a new column:
In[23]: data['density'] = data['pop'] /
data['area'] data
Out[23]: area pop density
California 423967 38332521 90.413926
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740

DataFrame as two-dimensional array


The DataFrame represented as an enhanced two-dimensional array. We can examine
data array using the values attribute:
In[24]: data.values
Out[24]: array([[ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
[ 1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
[ 1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
[ 1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
[ 6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])

we can transpose the full DataFrame to swap rows and columns:


In[25]: data.T
Out[25]:
California Florida Illinois New York Texas
area 4.239670e+05 1.703120e+05 1.499950e+05 1.412970e+05 6.956620e+05
pop 3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
density 9.041393e+01 1.148061e+02 8.588376e+01 1.390767e+02 3.801874e+01

Passing a single index to an array accesses a row:


In[26]: data.values[0]
Out[26]: array([ 4.23967000e+05, 3.83325210e+07,
9.04139261e+01])
and passing a single “index” to a DataFrame accesses a column:
In[27]: data['area']
Out[27]: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

Here Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using
the iloc indexer, we can index the underlying array as if it is a simple NumPy array
(using the implicit Python-style index), but the DataFrame index and column labels
are maintained in the result:
In[28]: data.iloc[:3, :2]
Out[28]: area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

In[29]: data.loc[:'Illinois', :'pop']


Out[29]: area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

The ix indexer allows a hybrid of these two approaches:


In[30]: data.ix[:3, :'pop']
Out[30]: area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135

Keep in mind that for integer indices, the ix indexer is subject to the same potential
sources of confusion as discussed for integer-indexed Series objects.
Any of the familiar NumPy-style data access patterns can be used within these index‐
ers. For example, in the loc indexer we can combine masking and fancy
indexing as in the following:
In[31]: data.loc[data.density > 100, ['pop', 'density']]
Out[31]: pop density
Florida 19552860 114.806121
New York 19651127 139.076746

Any of these indexing conventions may also be used to set or modify values; this is
done in the standard way that you might be accustomed to from working with NumPy:
In[32]: data.iloc[0, 2] = 90
data
Out[32]: area pop density
California 423967 38332521 90.000000
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740

To build up your fluency in Pandas data manipulation, I suggest spending some time
with a simple DataFrame and exploring the types of indexing, slicing, masking, and
fancy indexing that are allowed by these various indexing approaches.

Additional indexing conventions


indexing refers to columns, slicing refers to rows:
In[33]: data['Florida':'Illinois']
Out[33]: area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763

Such slices can also refer to rows by number rather than by index:
In[34]: data[1:3]
Out[34]: area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763

Similarly, direct masking operations are also interpreted row-wise rather than column-
wise:
In[35]: data[data.density > 100]
Out[35]: area pop density
Florida 170312 19552860 114.806121
New York 141297 19651127 139.076746
One of the essential pieces of NumPy is the ability to perform quick element-wise
operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and
with more sophisticated operations (trigonometric functions, exponential and loga‐
rithmic functions, etc.). Pandas inherits much of this functionality from NumPy, and
the ufuncs.

Ufuncs: Index Preservation


ufunc will work on Pandas Series and DataFrame objects. Let’s start by defining
a simple Series and DataFrame on which to demonstrate this:
In[1]: import pandas as pd
import numpy as np
In[2]: rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
Out[2]: 0 6
4 3
5 7
6 4
dtype: int64

In[3]: df = pd.DataFrame(rng.randint(0, 10, (3, 4)),


columns=['A', 'B', 'C', 'D'])
df
Out[3]: A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4

If we apply a NumPy ufunc on either of these objects, the result will be another Pan‐
das object with the indices preserved:
In[4]: np.exp(ser)
Out[4]: 0 403.428793
4 20.085537
5 1096.633158
6 54.598150
dtype: float64

Or, for a slightly more complex calculation:


In[5]: np.sin(df * np.pi / 4)
Out[5]: A B C D
0 -1.000000 7.071068e-01 1.000000 -1.000000e+00
1 -0.707107 1.224647e-16 0.707107 -7.071068e-01
2 -0.707107 1.000000e+00 -0.707107 1.224647e-16

Any of the ufuncs discussed in “Computation on NumPy Arrays: Universal Func‐


tions” on page 50 can be used in a similar manner.

UFuncs: Index Alignment


For binary operations on two Series or DataFrame objects, Pandas will align
indices in the process of performing the operation. This is very convenient when you
are working with incomplete data, as we’ll see in some of the examples that follow.

Index alignment in Series


As an example, suppose we are combining two different data sources, and find only
the top three US states by area and the top three US states by population:
In[6]: area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')

Let’s see what happens when we divide these to compute the population density:
In[7]: population / area
Out[7]: Alaska NaN
California 90.413926
New York NaN
Texas 38.018740
dtype: float64

The resulting array contains the union of indices of the two input arrays, which we
could determine using standard Python set arithmetic on these indices:
In[8]: area.index | population.index
Out[8]: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item for which one or the other does not have an entry is marked with NaN, or “Not a
Number,” which is how Pandas marks missing data (see further discussion of missing data
in “Handling Missing Data” on page 119). This index matching is imple‐
mented this way for any of Python’s built-in arithmetic expressions; any missing val‐
ues are filled in with NaN by default:
In[9]: A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
Out[9]: 0 NaN
4 5.0
5 9.0
6 NaN
dtype: float64

If using NaN values is not the desired behavior, we can modify the fill value using
appropriate object methods in place of the operators. For example, calling A.add(B)
is equivalent to calling A + B, but allows optional explicit specification of the fill
value for any elements in A or B that might be missing:
In[10]: A.add(B, fill_value=0)
Out[10]: 0 2.0
4 5.0
5 9.0
6 5.0
dtype: float64

Index alignment in DataFrame


A similar type of alignment takes place for both columns and indices when you are
performing operations on DataFrames:
In[11]: A = pd.DataFrame(rng.randint(0, 20, (2,
2)), columns=list('AB'))
A
Out[11]: A B
0 1 11
1 5 1
In[12]: B = pd.DataFrame(rng.randint(0, 10, (3,
3)), columns=list('BAC'))
B
Out[12]: B A C
0 4 0 9
1 5 8 0
2 9 2 6

In[13]: A +
B Out[13]: A B C
3 1.0 15.0 NaN
4 13.0 6.0 NaN
5 NaN NaN NaN
Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in
place of missing entries. Here we’ll fill with the mean of all values in A (which we
compute by first stacking the rows of A):
In[14]: fill = A.stack().mean()
A.add(B, fill_value=fill)
Out[14]: A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5

Table 3-1 lists Python operators and their equivalent Pandas object methods.

Table 3-1. Mapping between Python operators and Pandas methods


Python operator Pandas method(s)

+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
** pow()

Ufuncs: Operations Between DataFrame and Series


When you are performing operations between a DataFrame and a Series, the index
and column alignment is similarly maintained. Operations between a DataFrame and
a Series are similar to operations between a two-dimensional and one-dimensional
NumPy array. Consider one common operation, where we find the difference of a two-
dimensional array and one of its rows:
In[15]: A = rng.randint(10, size=(3, 4))
A
Out[15]: array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
In[16]: A - A[0]
Out[16]: array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
According to NumPy’s broadcasting rules (see “Computation on Arrays: Broadcast‐
ing” on page 63), subtraction between a two-dimensional array and one of its rows is
applied row-wise.
In Pandas, the convention similarly operates row-wise by default:
In[17]: df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
Out[17]: Q R S T
0 0 0 0 0
1 -1 -2 2 4
2 3 -7 1 4

If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword:
In[18]: df.subtract(df['R'], axis=0)
Out[18]: Q R S T
0 -5 0 -6 -4
1 -4 0 -2 2
2 5 0 2 7

Note that these DataFrame/Series operations, like the operations discussed


before, will automatically align indices between the two elements:
In[19]: halfrow = df.iloc[0, ::2]
halfrow
Out[19]: Q 3
S 2
Name: 0, dtype: int64
In[20]: df - halfrow
Out[20]: Q R S T
3 0.0 NaN 0.0 NaN
4 -1.0 NaN 2.0 NaN
5 3.0 NaN 1.0 NaN

This preservation and alignment of indices and columns means that operations on data
in Pandas will always maintain the data context, which prevents the types of silly errors
that might come up when you are working with heterogeneous and/or mis‐ aligned
data in raw NumPy arrays.

Handling Missing Data


Real-world data is rarely clean and homogeneous.
Many interesting datasets will have some amount of data missing.
Different data sources may indicate missing data in different ways.
Missing data in general represented as null, NaN, or NA values in Pandas.

Trade-Offs in Missing Data Conventions


Two strategies:
using a 1. mask that globally indicates missing values, or
the mask might be an entirely separate Boolean array, or it may involve
appropriation of one bit in the data representation to locally indicate the null
status of a value.

Use of a separate mask array requires allocation of an additional Boolean


array, which adds overhead in both storage and computation.

choosing a 2. sentinel value that indicates a missing entry.


the sentinel value could be some data-specific convention, such as indicating
a missing integer value with –9999 or some rare bit pattern, or it could be a
more global convention, such as indicating a missing floating-point value with
NaN (Not a Number)

A sentinel value reduces the range of valid values that can be represented, and
may require extra (often non-optimized) logic in CPU and GPU arithmetic.
Common special values like NaN are not available for all data types.

Missing Data in Pandas


Pandas use sentinels for missing data, and further chose to use two already-existing
Python null values: the special floating-point NaN value, and the Python None
object.

None: Pythonic missing data


The first sentinel value used by Pandas is None, a Python singleton object that is
often used for missing data in Python code. Because None is a Python object, it
cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type
'object' (i.e., arrays of Python objects):

In[1]: import numpy as np


import pandas as pd
In[2]: vals1 = np.array([1, None, 3, 4])
vals1
Out[2]: array([1, None, 3, 4], dtype=object)

This dtype=object means that the best common type representation NumPy could
infer for the contents of the array is that they are Python objects.
NaN: Missing numerical data
The other missing data representation, NaN (acronym for Not a Number), is different;
it is a special floating-point value recognized by all systems that use the standard IEEE
floating-point representation:
In[5]: vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype
Out[5]: dtype('float64')

Regardless of the operation, the result of arithmetic with NaN will be another NaN:
In[6]: 1 + np.nan
Out[6]: nan
In[7]: 0 * np.nan
Out[7]: nan

Aggregates over the values are well defined (i.e., they don’t result in an error) but not
always useful:
In[8]: vals2.sum(), vals2.min(), vals2.max()
Out[8]: (nan, nan, nan)
NumPy does provide some special aggregations that will ignore these missing values.

In[9]: np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)


Out[9]: (8.0, 1.0, 4.0)
Keep in mind that NaN is specifically a floating-point value; there is no equivalent
NaN value for integers, strings, or other types.

NaN and None in Pandas


NaN and None both have their place, and Pandas is built to handle the two of them
nearly interchangeably, converting between them where appropriate:
In[10]: pd.Series([1, np.nan, 2, None])
Out[10]: 0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64

Pandas automatically type-casts when NA values are present. For example, if we set a
value in an integer array to np.nan, it will automatically be upcast to a floating-point
type to accommodate the NA:
In[11]: x = pd.Series(range(2), dtype=int)
x
Out[11]: 0 0
1 1
dtype: int64

In[12]: x[0] = None


x
Out[12]: 0 NaN
1 1.0
dtype: float64
Notice that in addition to casting the integer array to floating point, Pandas
automatically converts the None to a NaN value.

Table 3-2. Pandas handling of NAs by type


Typeclass Conversion when storing NAs NA sentinel value
floating No change np.nan
object No change None or np.nan
integer Cast to float64 np.nan
boolean Cast to object None or np.nan

Keep in mind that in Pandas, string data is always stored with an object dtype.

Operating on Null Values


Pandas treats None and NaN as essentially interchangeable for indicating missing or
null values. To facilitate this convention, there are several useful methods for
detecting, removing, and replacing null values in Pandas data structures. They are:
isnull()
Generate a Boolean mask indicating missing values
notnull()
Opposite of isnull()
dropna()
Return a filtered version of the data
fillna()
Return a copy of the data with missing values filled or imputed

Detecting null values


Pandas data structures have two useful methods for detecting null data: isnull()
and notnull(). Either one will return a Boolean mask over the data. For example:
In[13]: data = pd.Series([1, np.nan, 'hello', None])
In[14]: data.isnull()
Out[14]: 0 False
1 True
2 False
3 True
dtype: bool

In[15]: data[data.notnull()]
Out[15]: 0 1
2 hello
dtype: object

The isnull() and notnull() methods produce similar Boolean results for
Data Frames.
Dropping null values
dropna() (which removes NA values) and fillna() (which fills in NA values).
For a Series, the result is straightforward:
In[16]: data.dropna()
Out[16]: 0 1
2 hello
dtype: object

For a DataFrame, there are more options. Consider the following DataFrame:
In[17]: df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
Out[17]: 0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6

We cannot drop single values from a DataFrame; we can only drop full rows or full
columns.
dropna() will drop all rows in which any null value is present:
In[18]: df.dropna()
Out[18]: 0 1 2
1 2.0 3.0 5

drop NA values along a different axis; axis=1 drops all columns containing a null
value:
In[19]: df.dropna(axis='columns')
Out[19]: 2
0 2
1 5
2 6
But this drops some good data as well; you might rather be interested in dropping rows
or columns with all NA values, or a majority of NA values.
how or thresh parameters, allow fine control of the number of nulls to allow.
how='any', allows row or column (depending on the axis key word) containing a
null value will be dropped.
how='all', which will only drop rows/columns that are all null values:
In[20]: df[3] = np.nan
df
Out[20]: 0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In[21]: df.dropna(axis='columns', how='all')
Out[21]: 0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6

thresh parameter specify a minimum number of non-null values for the row/column
to be kept:
In[22]: df.dropna(axis='rows', thresh=3)
Out[22]: 0 1 2 3
1 2.0 3.0 5 NaN

Here the first and last row have been dropped, because they contain only two non-null
values.

Filling null values


Sometimes rather than dropping NA values, replace them with a valid value. This value
might be a single number like zero, or it might be some sort of imputation or
interpolation from the good values.
isnull() method as a mask, but because it is such a common operation Pandas
provides the fillna() method, which returns a copy of the array with the null values
replaced.
Consider the following Series:
In[23]: data = pd.Series([1, np.nan, 2, None, 3],
index=list('abcde')) data
Out[23]: a 1.0
b NaN
c 2.0
d NaN
e 3.0
dtype: float64

fill NA entries with a single value, such as zero:


In[24]: data.fillna(0)
Out[24]: a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64

We can specify a forward-fill to propagate the previous value forward:


In[25]: # forward-fill
data.fillna(method='ffill')
Out[25]: a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64

Or we can specify a back-fill to propagate the next values backward:


In[26]: # back-fill
data.fillna(method='bfill')
Out[26]: a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64

For DataFrames, the options are similar, but we can also specify an axis along
which the fills take place:
In[27]: df
Out[27]: 0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In[28]: df.fillna(method='ffill', axis=1)
Out[28]: 0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0

Notice that if a previous value is not available during a forward fill, the NA
value remains.
Hierarchical Indexing
Create a pandas.Series object with a MultiIndex. Let's break it down:
1. Data: Randomly generated numbers using np.random.uniform(size=9),
resulting in 9 floating-point values.
2. Index: A MultiIndex with two levels:
o The first level has the values: ["a", "a", "a", "b", "b", "c", "c", "d",
"d"].
o The second level has the values: [1, 2, 3, 1, 3, 1, 2, 2, 3].

data =
pd.Series(np.random.uniform(size=9),index=[["a",
"a", "a", "b", "b", "c", "c", "d", "d"],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
a 1 0.412178
2 0.158417
3 0.111628
b 1 0.394864
3 0.001674
c 1 0.498256
2 0.936531
d 2 0.096640
3 0.746740
dtype: float64

data.index
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
A hierarchically indexed object (partial indexing) is used to select subsets of the
data:

data["b"]
1 0.963700
3 0.042998
dtype: float64
data["b":"c"]
b 1 0.963700
3 0.042998
c 1 0.111421
2 0.131365
dtype: float64

data.loc[["b", "d"]]
b 1 0.963700
3 0.042998
d 2 0.943148
3 0.744674
dtype: float64

Selection is even possible from an “inner” level


select all of the values having the value 2 from the second index level:

data.loc[:, 2]
a 0.713453
c 0.131365
d 0.943148
dtype: float64

Hierarchical indexing plays an important role in reshaping data and in group-


based operations like forming a pivot table.

Rearrange this data into a DataFrame using its unstack method:


data.unstack()

The inverse operation of unstack is stack:


data.unstack().stack()
With a DataFrame, either axis can have a hierarchical index:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
columns=[["Ohio", "Ohio", "Colorado"],
["Green", "Red", "Green"]])
Frame

The hierarchical levels can have names as strings or any Python objects

frame.index.names = ["key1", "key2"]


frame

frame.columns.names = ["state", "color"]


frame
how many levels an index has by accessing its nlevels attribute:

frame.index.nlevels
2
With partial column indexing you can similarly select groups of columns:

frame["Ohio"]

A MultiIndex can be created by itself and then reused

pd.MultiIndex.from_arrays([["Ohio", "Ohio", "Colorado"],


["Green", "Red", "Green"]],
names=["state", "color"])

Reordering and Sorting Levels


frame.swaplevel("key1", "key2")
sort_index by default sorts the data lexicographically using all the index levels,
but you can choose to use only a single level or a subset of levels to sort by passing
the level argument.
frame.sort_index(level=1)

frame.swaplevel(0, 1).sort_index(level=0)

Summary Statistics by Level


Many descriptive and summary statistics on DataFrame and Series have a level
option in which we can specify the level to aggregate by on a particular axis.
frame.groupby(level="key2").sum()

frame.groupby(level="color", axis="columns").sum()
Indexing with a DataFrame’s columns
frame = pd.DataFrame({"a": range(7), "b": range(7, 0, -1),
"c": ["one", "one", "one", "two", "two",
"two", "two"],
"d": [0, 1, 2, 0, 1, 2, 3]})

DataFrame’s set_index function will create a new DataFrame using one or more of
its columns as the index:

frame2 = frame.set_index(["c", "d"])


frame2
By default, the columns are removed from the DataFrame, though you can leave
them in by passing drop=False to set_index:

frame.set_index(["c", "d"], drop=False)

reset_index, on the other hand, does the opposite of set_index; the hierarchical index
levels are moved into the columns:
frame2.reset_index()

You might also like