2.3 Operations in Pandas
2.3 Operations in Pandas
2.3 Operations in Pandas
One of the strengths of NumPy is that it allows us to perform quick element-wise operations, both with basic
arithmetic (addition, subtraction, multiplication, etc.) and with more complicated operations (trigonometric
functions, exponential and logarithmic functions, etc.).
Pandas inherits much of this functionality from NumPy, and the ufuncs.
Pandas includes a couple of useful twists, however:
For unary operations like negation and trigonometric functions, these ufuncs will preserve index
and column labels in the output.
For binary operations such as addition and multiplication, Pandas will automatically align
indices when passing the objects to the ufunc.
This means that keeping the context of data and combining data from different sources—both potentially error-
prone tasks with raw NumPy arrays—become essentially fool proof with Pandas.
We will additionally see that there are well-defined operations between one-dimensional Series structures and
two-dimensional DataFrame structures.
0 4 8 0 6
1 2 0 5 9
2 7 7 7 7
If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices
preserved:
In [4]:
np.exp(ser)
Out[4]:
0 1.000000
1 1096.633158
2 403.428793
3 54.598150
dtype: float64
-
1 1.000000e+00 0.000000e+00 0.707107
0.707107
-7.071068e- -
2 -7.071068e-01 -0.707107
01 0.707107
population / area
Out[7]:
Alaska NaN
California 93.257784
Florida NaN
Texas 41.896072
dtype: float64
The resulting array contains the union of indices of the two input arrays, which could be determined directly from these
indices:
In [8]:
area.index.union(population.index)
Out[8]:
Index(['Alaska', 'California', 'Florida', 'Texas'], dtype='object')
Any item for which one or the other does not have an entry is marked with NaN, or "Not a Number," which is how
Pandas marks missing data.
Index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are
marked by NaN:
In [9]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
Out[9]:
0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place
of the operators. For example, calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification
of the fill value for any elements in A or B that might be missing:
A.add(B, fill_value=0)
Out[10]:
0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64
a b
0 10 2
1 16 9
In [12]:
B = pd.DataFrame(rng.integers(0, 10, (3, 3)),
columns=['b', 'a', 'c'])
B
b A c
0 5 3 1
1 9 7 6
b A c
2 4 8 5
In [13]: A + B
Out[13]:
a b c
Python
Pandas method(s)
operator
+ add
- sub, subtract
* mul, multiply
// floordiv
% mod
** pow
Python
Pandas method(s)
operator
When performing operations between a DataFrame and a Series, the index and column alignment is similarly
maintained, and the result is similar to operations between a two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:
In [15]:
A = rng.integers(10, size=(3, 4))
Array([[4, 4, 2, 0],
[5, 8, 0, 8],
[8, 2, 6, 1]])
In [16]:
A - A[0]
Out[16]:
array([[ 0, 0, 0, 0],
[ 1, 4, -2, 8],
[ 4, -2, 4, 1]])
According to NumPy's broadcasting rules subtraction between a two-dimensional array and one of its rows is applied
row-wise.
0 0 0 0 0
-
1 1 4 8
2
2 4 -2 4 1
If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying
the axis keyword:
df.subtract(df['R'], axis=0)
Out[18]:
Q R S T
0 0 0 -2 -4
1 -3 0 -8 0
Q R S T
2 6 0 4 -1
Note that these DataFrame/Series operations, like the operations discussed previously, will automatically align indices
between the two elements:
In [19]:
halfrow = df.iloc[0, ::2]
halfrow
Out[19]:
Q 4
S 2
Name: 0, dtype: int64
In [20]:
df - halfrow
Out[20]:
Q R S T