Data Manipulation With Pandas
Data Manipulation With Pandas
the Series wraps both a sequence of values and a sequence of indices, which we can
access with the values and index attributes. The values are simply a familiar
NumPy array:
In[3]: data.values
Out[3]: array([ 0.25, 0.5 , 0.75, 1. ])
Data can be accessed by the associated index via the familiar Python square-bracket notation:
In[5]: data[1]
Out[5]: 0.5
In[6]: data[1:3]
Out[6]: 1 0.50
2 0.75
dtype: float64
The Pandas Series is much more general and flexible than the one-dimensional
NumPy array that it emulates.
Series as generalized NumPy array
Series object is basically inter‐changeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the NumPy array has an
implicitly defined integer index used to access the values, the Pandas Series has an
explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities.
For example, the index need not be an integer, but can consist of values of any desired
type. For example, if we wish, we can use strings as an index:
In[7]: data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
Out[7]: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
By default, a Series will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:
In[12]: population['California']
Out[12]: 38332521
where index is an optional argument, and data can be one of many entities.
For example, data can be a list or NumPy array, in which case index defaults to
an integer sequence:
In[14]: pd.Series([2, 4, 6])
Out[14]: 0 2
1 4
2 6
dtype: int64
data can be a dictionary, in which index defaults to the sorted dictionary keys:
In[16]: pd.Series({2:'a', 1:'b', 3:'c'})
Out[16]: 1 b
2 a
3 c
4 dtype:
object
the index can be explicitly set if a different result is preferred:
In[17]: pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
Out[17]: 3 c
2 a dtype:
object
Notice that in this case, the Series is populated only with the explicitly identified
keys.
area = pd.Series(area_dict)
area
Out[18]: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
Now that we have this along with the population Series from before, we can use a
dictionary to construct a single two-dimensional object containing this information:
In[19]: states = pd.DataFrame({'population': population,
'area': area})
states
Out[19]: area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
Like the Series object, the DataFrame has an index attribute that gives access to
the index labels:
In[20]: states.index
Out[20]:
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
From a list of dicts. Any list of dictionaries can be made into a DataFrame.
Use a simple list comprehension to create some data:
In[24]: data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
Out[24]: a b
0 0 0
1 1 2
2 2 4
Even if some keys in the dictionary are missing, Pandas will fill them in with NaN
(i.e., “not a number”) values:
In[25]: pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
Out[25]: a b c
0 1.0 2 NaN
1 NaN 3 4.0
2), columns=['foo',
'bar'], index=['a',
'b', 'c'])
In[29]: pd.DataFrame(A)
Out[29]: A B
0 0 0.0
1 0 0.0
2 0 0.0
Index objects also have many of the attributes familiar from NumPy arrays:
In[33]: print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64
One difference between Index objects and NumPy arrays is that indices are
immutable—that is, they cannot be modified via the normal means:
In[34]: ind[1] = 0
---------------------------------------------------------------------------
<ipython-input-34-40e631c82e8a> in <module>()
----> 1 ind[1] = 0
/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py ...
1243
1244 def __setitem__(self, key, value):
-> 1245 raise TypeError("Index does not support mutable
operations") 1246
1247 def __getitem__(self, key):
This immutability makes it safer to share indices between multiple DataFrames and
arrays, without the potential for side effects from inadvertent index modification.
If we apply a NumPy ufunc on either of these objects, the result will be another Pan‐
das object with the indices preserved:
In[4]: np.exp(ser)
Out[4]: 0 403.428793
1 20.085537
2 1096.633158
3 54.598150
dtype: float64
Let’s see what happens when we divide these to compute the population density:
In[7]: population / area
Out[7]: Alaska NaN
California 90.413926
New York NaN
Texas 38.018740
dtype: float64
The resulting array contains the union of indices of the two input arrays, which we
could determine using standard Python set arithmetic on these indices:
In[8]: area.index | population.index
Out[8]: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
Any item for which one or the other does not have an entry is marked with NaN, or “Not
a Number. This index matching is implemented this way for any of Python’s built-in
arithmetic expressions; any missing val‐ ues are filled in with NaN by default:
In[9]: A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
Out[9]: 0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
If using NaN values is not the desired behavior, we can modify the fill value using
appropriate object methods in place of the operators. For example, calling A.add(B)
is equivalent to calling A + B, but allows optional explicit specification of the fill
value for any elements in A or B that might be missing:
In[10]: A.add(B, fill_value=0)
Out[10]: 0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64
A
B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in
place of missing entries. Here we’ll fill with the mean of all values in A (which we
compute by first stacking the rows of A):
In[14]: fill = A.stack().mean()
A.add(B, fill_value=fill)
Out[14]: A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in
place of missing entries. Here we’ll fill with the mean of all values in A (which we
compute by first stacking the rows of A):
In[14]: fill = A.stack().mean()
A.add(B, fill_value=fill)
Out[14]: A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
Table 3-1 lists Python operators and their equivalent Pandas object methods.
+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
**pow()
If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword:
In[18]: df.subtract(df['R'], axis=0)
Out[18]: Q R S T
0 -5 0 -6 -4
1 -4 0 -2 2
2 5 0 2 7
This preservation and alignment of indices and columns means that operations on data
in Pandas will always maintain the data context, which prevents the types of silly errors
that might come up when you are working with heterogeneous and/or mis‐ aligned
data in raw NumPy arrays.
Data Indexing and Selection
Learning for accessing and modifying values in Pandas Series and DataFrame objects.
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to
a collection of values:
In[1]: import pandas as pd
data = pd.Series([0.25, 0.5,
0.75, 1.0],
index=['a', 'b', 'c',
'd'])
data
Out[1]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
In[2]: data['b']
Out[2]: 0.5
In[3]: 'a' in data
Out[3]: True
In[4]: data.keys()
Out[4]: Index(['a', 'b', 'c', 'd'], dtype='object')
In[5]: list(data.items())
Out[5]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d',
1.0)]
A Series can be assigned to a new index value:
In[6]: data['e'] = 1.25
data
Out[6]:
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
Because of this potential confusion in the case of integer indexes, Pandas provides some
special indexer attributes that explicitly expose certain indexing schemes. These
are not functional methods, but attributes that expose a particular slicing interface to
the data in the Series.
First, the loc attribute allows indexing and slicing that always references the
explicit index:
In[14]: data.loc[1]
Out[14]: 'a'
In[15]: data.loc[1:3]
Out[15]: 1 a
3 b dtype:
object
The iloc attribute allows indexing and slicing that always references the implicit
Python-style index:
In[16]: data.iloc[1]
Out[16]: 'b'
In[17]: data.iloc[1:3]
Out[17]: 3 b
5 c dtype:
object
A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equiva‐
lent to standard []-based indexing. The purpose of the ix indexer will become more
apparent in the context of DataFrame objects, which we will discuss in a moment.
One guiding principle of Python code is that “explicit is better than implicit.” The
explicit nature of loc and iloc make them very useful in maintaining clean and
readable code; especially in the case of integer indexes, I recommend using these both
to make code easier to read and understand, and to prevent subtle bugs due to the mixed
indexing/slicing convention.
DataFrame as a dictionary
Let’s return to our example of areas and populations of states:
In[18]: area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
Out[18]: area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
it does not work for all cases! For example, if the column names are not strings, or if
the column names conflict with methods of the DataFrame, this attribute-style access
is not possible. For example, the DataFrame has a pop() method, so data.pop
will point to this rather than the "pop" column:
In[22]: data.pop is data['pop']
Out[22]: False
In particular, you should avoid the temptation to try column assignment via attribute
(i.e., use data['pop'] = z rather than data.pop = z).
Like with the Series objects discussed earlier, this dictionary-style syntax can also
be used to modify the object, in this case to add a new column:
In[23]: data['density'] = data['pop'] /
data['area'] data
Out[23]: area pop density
California 423967 38332521 90.413926
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740
Here Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using
the iloc indexer, we can index the underlying array as if it is a simple NumPy array
(using the implicit Python-style index), but the DataFrame index and column labels
are maintained in the result:
In[28]: data.iloc[:3, :2]
Out[28]: area pop
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
Keep in mind that for integer indices, the ix indexer is subject to the same potential
sources of confusion as discussed for integer-indexed Series objects.
Any of the familiar NumPy-style data access patterns can be used within these index‐
ers. For example, in the loc indexer we can combine masking and fancy
indexing as in the following:
In[31]: data.loc[data.density > 100, ['pop', 'density']]
Out[31]: pop density
Florida 19552860 114.806121
New York 19651127 139.076746
Any of these indexing conventions may also be used to set or modify values; this is
done in the standard way that you might be accustomed to from working with NumPy:
In[32]: data.iloc[0, 2] = 90
data
Out[32]: area pop density
California 423967 38332521 90.000000
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
New York 141297 19651127 139.076746
Texas 695662 26448193 38.018740
To build up your fluency in Pandas data manipulation, I suggest spending some time
with a simple DataFrame and exploring the types of indexing, slicing, masking, and
fancy indexing that are allowed by these various indexing approaches.
Such slices can also refer to rows by number rather than by index:
In[34]: data[1:3]
Out[34]: area pop density
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
Similarly, direct masking operations are also interpreted row-wise rather than column-
wise:
In[35]: data[data.density > 100]
Out[35]: area pop density
Florida 170312 19552860 114.806121
New York 141297 19651127 139.076746
One of the essential pieces of NumPy is the ability to perform quick element-wise
operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and
with more sophisticated operations (trigonometric functions, exponential and loga‐
rithmic functions, etc.). Pandas inherits much of this functionality from NumPy, and
the ufuncs.
If we apply a NumPy ufunc on either of these objects, the result will be another Pan‐
das object with the indices preserved:
In[4]: np.exp(ser)
Out[4]: 0 403.428793
4 20.085537
5 1096.633158
6 54.598150
dtype: float64
Let’s see what happens when we divide these to compute the population density:
In[7]: population / area
Out[7]: Alaska NaN
California 90.413926
New York NaN
Texas 38.018740
dtype: float64
The resulting array contains the union of indices of the two input arrays, which we
could determine using standard Python set arithmetic on these indices:
In[8]: area.index | population.index
Out[8]: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
Any item for which one or the other does not have an entry is marked with NaN, or “Not a
Number,” which is how Pandas marks missing data (see further discussion of missing data
in “Handling Missing Data” on page 119). This index matching is imple‐
mented this way for any of Python’s built-in arithmetic expressions; any missing val‐
ues are filled in with NaN by default:
In[9]: A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
Out[9]: 0 NaN
4 5.0
5 9.0
6 NaN
dtype: float64
If using NaN values is not the desired behavior, we can modify the fill value using
appropriate object methods in place of the operators. For example, calling A.add(B)
is equivalent to calling A + B, but allows optional explicit specification of the fill
value for any elements in A or B that might be missing:
In[10]: A.add(B, fill_value=0)
Out[10]: 0 2.0
4 5.0
5 9.0
6 5.0
dtype: float64
In[13]: A +
B Out[13]: A B C
3 1.0 15.0 NaN
4 13.0 6.0 NaN
5 NaN NaN NaN
Notice that indices are aligned correctly irrespective of their order in the two objects,
and indices in the result are sorted. As was the case with Series, we can use the asso‐
ciated object’s arithmetic method and pass any desired fill_value to be used in
place of missing entries. Here we’ll fill with the mean of all values in A (which we
compute by first stacking the rows of A):
In[14]: fill = A.stack().mean()
A.add(B, fill_value=fill)
Out[14]: A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
Table 3-1 lists Python operators and their equivalent Pandas object methods.
+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
** pow()
If you would instead like to operate column-wise, you can use the object methods
mentioned earlier, while specifying the axis keyword:
In[18]: df.subtract(df['R'], axis=0)
Out[18]: Q R S T
0 -5 0 -6 -4
1 -4 0 -2 2
2 5 0 2 7
This preservation and alignment of indices and columns means that operations on data
in Pandas will always maintain the data context, which prevents the types of silly errors
that might come up when you are working with heterogeneous and/or mis‐ aligned
data in raw NumPy arrays.
A sentinel value reduces the range of valid values that can be represented, and
may require extra (often non-optimized) logic in CPU and GPU arithmetic.
Common special values like NaN are not available for all data types.
This dtype=object means that the best common type representation NumPy could
infer for the contents of the array is that they are Python objects.
NaN: Missing numerical data
The other missing data representation, NaN (acronym for Not a Number), is different;
it is a special floating-point value recognized by all systems that use the standard IEEE
floating-point representation:
In[5]: vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype
Out[5]: dtype('float64')
Regardless of the operation, the result of arithmetic with NaN will be another NaN:
In[6]: 1 + np.nan
Out[6]: nan
In[7]: 0 * np.nan
Out[7]: nan
Aggregates over the values are well defined (i.e., they don’t result in an error) but not
always useful:
In[8]: vals2.sum(), vals2.min(), vals2.max()
Out[8]: (nan, nan, nan)
NumPy does provide some special aggregations that will ignore these missing values.
Pandas automatically type-casts when NA values are present. For example, if we set a
value in an integer array to np.nan, it will automatically be upcast to a floating-point
type to accommodate the NA:
In[11]: x = pd.Series(range(2), dtype=int)
x
Out[11]: 0 0
1 1
dtype: int64
Keep in mind that in Pandas, string data is always stored with an object dtype.
In[15]: data[data.notnull()]
Out[15]: 0 1
2 hello
dtype: object
The isnull() and notnull() methods produce similar Boolean results for
Data Frames.
Dropping null values
dropna() (which removes NA values) and fillna() (which fills in NA values).
For a Series, the result is straightforward:
In[16]: data.dropna()
Out[16]: 0 1
2 hello
dtype: object
For a DataFrame, there are more options. Consider the following DataFrame:
In[17]: df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
Out[17]: 0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
We cannot drop single values from a DataFrame; we can only drop full rows or full
columns.
dropna() will drop all rows in which any null value is present:
In[18]: df.dropna()
Out[18]: 0 1 2
1 2.0 3.0 5
drop NA values along a different axis; axis=1 drops all columns containing a null
value:
In[19]: df.dropna(axis='columns')
Out[19]: 2
0 2
1 5
2 6
But this drops some good data as well; you might rather be interested in dropping rows
or columns with all NA values, or a majority of NA values.
how or thresh parameters, allow fine control of the number of nulls to allow.
how='any', allows row or column (depending on the axis key word) containing a
null value will be dropped.
how='all', which will only drop rows/columns that are all null values:
In[20]: df[3] = np.nan
df
Out[20]: 0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In[21]: df.dropna(axis='columns', how='all')
Out[21]: 0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
thresh parameter specify a minimum number of non-null values for the row/column
to be kept:
In[22]: df.dropna(axis='rows', thresh=3)
Out[22]: 0 1 2 3
1 2.0 3.0 5 NaN
Here the first and last row have been dropped, because they contain only two non-null
values.
For DataFrames, the options are similar, but we can also specify an axis along
which the fills take place:
In[27]: df
Out[27]: 0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In[28]: df.fillna(method='ffill', axis=1)
Out[28]: 0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0
Notice that if a previous value is not available during a forward fill, the NA
value remains.
Hierarchical Indexing
Create a pandas.Series object with a MultiIndex. Let's break it down:
1. Data: Randomly generated numbers using np.random.uniform(size=9),
resulting in 9 floating-point values.
2. Index: A MultiIndex with two levels:
o The first level has the values: ["a", "a", "a", "b", "b", "c", "c", "d",
"d"].
o The second level has the values: [1, 2, 3, 1, 3, 1, 2, 2, 3].
data =
pd.Series(np.random.uniform(size=9),index=[["a",
"a", "a", "b", "b", "c", "c", "d", "d"],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
a 1 0.412178
2 0.158417
3 0.111628
b 1 0.394864
3 0.001674
c 1 0.498256
2 0.936531
d 2 0.096640
3 0.746740
dtype: float64
data.index
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
A hierarchically indexed object (partial indexing) is used to select subsets of the
data:
data["b"]
1 0.963700
3 0.042998
dtype: float64
data["b":"c"]
b 1 0.963700
3 0.042998
c 1 0.111421
2 0.131365
dtype: float64
data.loc[["b", "d"]]
b 1 0.963700
3 0.042998
d 2 0.943148
3 0.744674
dtype: float64
data.loc[:, 2]
a 0.713453
c 0.131365
d 0.943148
dtype: float64
The hierarchical levels can have names as strings or any Python objects
frame.index.nlevels
2
With partial column indexing you can similarly select groups of columns:
frame["Ohio"]
frame.swaplevel(0, 1).sort_index(level=0)
frame.groupby(level="color", axis="columns").sum()
Indexing with a DataFrame’s columns
frame = pd.DataFrame({"a": range(7), "b": range(7, 0, -1),
"c": ["one", "one", "one", "two", "two",
"two", "two"],
"d": [0, 1, 2, 0, 1, 2, 3]})
DataFrame’s set_index function will create a new DataFrame using one or more of
its columns as the index:
reset_index, on the other hand, does the opposite of set_index; the hierarchical index
levels are moved into the columns:
frame2.reset_index()