Unit III - Pandas - Data Manipulation Using Python
Unit III - Pandas - Data Manipulation Using Python
Pandas
Three fundamental Pandas data structures: the Series, DataFrame, and Index.
import numpy as np
import pandas as pd
data.values
array([ 0.25, 0.5 , 0.75, 1. ])
data.index
RangeIndex(start=0, stop=4, step=1)
data[1]
0.5
data[1:3]
1 0.50
2 0.75
dtype: float64
Series as generalized NumPy array
We can use strings as an index:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
data['b']
0.5
data[5]
0.5
population['California']
38332521
population['California':'Illinois']
California 38332521
Florida 19552860
Illinois 12882135
dtype: int64
Constructing Series objects
Where index is an optional argument and data can be one of many entities.
pd.Series([2, 4, 6])
0 2
1 4
2 6
dtype: int64
DataFrame is an analog of a two-dimensional array with both flexible row indices and
flexible column names.
states.index
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
states.columns
Index(['area', 'population'], dtype='object')
pd.DataFrame(population, columns=['population'])
population
California38332521
Florida 19552860
Illinois 12882135
New York19651127
Texas 26448193
From a list of dicts
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
a b
0 0 0
1 1 2
2 2 4
foo bar
a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718
pd.DataFrame(A)
A B
0 0 0.0
1 0 0.0
2 0 0.0
ind[::2]
Int64Index([2, 5, 11], dtype='int64')
One difference between Index objects and NumPy arrays is that indices are immutable–
that is, they cannot be modified via the normal means:
ind[1] = 0
Index as ordered set
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
Int64Index([3, 5, 7], dtype='int64')
First, the loc attribute allows indexing and slicing that always references the explicit
index:
data.loc[1]
'a'
data.loc[1:3]
1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the implicit Python-
style index:
data.iloc[1]
'b'
data.iloc[1:3]
3 b
5 c
dtype: object
A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to
standard []-based indexing.
The purpose of the ix indexer will become more apparent in the context
of DataFrame objects
One of the essential pieces of NumPy is the ability to perform quick element-wise
operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with
more sophisticated operations (trigonometric functions, exponential and logarithmic
functions, etc.).
Pandas inherit much of this functionality from NumPy, and the ufuncs.
i. For unary operations like negation and trigonometric functions, these ufuncs
will preserve index and column labels in the output.
ii. For binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
In particular, many interesting datasets will have some amount of data missing.
Here we will see some general considerations for missing data, and how Pandas chooses
to represent it.
Demonstrate some built-in Pandas tools for handling missing data in Python.
We refer missing data in general as null, NaN, or NA values.
There are a number of schemes that have been developed to indicate the presence of
missing data in a table or DataFrame.
Two strategies: using a mask that globally indicates missing values, or choosing
a sentinel value (indicating a missing integer value with -9999 or NaN or None) that
indicates a missing entry.
To facilitate this convention, there are several useful methods for detecting, removing,
and replacing null values in Pandas data structures. They are:
isnull(): Generate a boolean mask indicating missing values
notnull(): Opposite of isnull()
dropna(): Return a filtered version of the data
fillna(): Return a copy of the data with missing values filled or imputed
Hierarchical Indexing
Pandas provide objects that handle three-dimensional and four-dimensional data.
In this way, higher-dimensional data can be compactly represented within the familiar
one-dimensional Series and two-dimensional DataFrame objects.
Pandas MultiIndex
import pandas as pd
import numpy as np
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
(California, 2000) 33871648
(California, 2010) 37253956
(New York, 2000) 18976457
(New York, 2010) 19378102
(Texas, 2000) 20851820
(Texas, 2010) 25145561
dtype: int64
index = pd.MultiIndex.from_tuples(index)
index
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
The MultiIndex contains multiple levels of indexing–in this case, the state names and the
years, as well as multiple labels for each data point which encode these levels.
If we re-index our series with this MultiIndex, we see the hierarchical representation of
the data:
pop = pop.reindex(index)
pop
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
pop[:, 2010]
California 37253956
New York 19378102
Texas 25145561
dtype: int64
pop_df = pop.unstack()
pop_df
2000 2010
California3387164837253956
Texas 2085182025145561
The stack() method provides the opposite operation:
pop_df.stack()
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
total under18
California2000338716489267089
2010372539569284094
2010193781024318033
Texas 2000208518205906301
2010251455616879014
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
data1 data2
a 10.5542330.356072
20.9252440.219474
b10.4417590.610054
20.1714950.886688
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
pop.loc['California':'New York']
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64
Combining Datasets:
Concat and Append
import pandas as pd
import numpy as np
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
array([[1, 2, 1, 2],
[3, 4, 3, 4]])
display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")
Categories of Joins
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-
one, and many-to-many joins.
All three types of joins are accessed via an identical call to the pd.merge() interface; the
type of join performed depends on the form of the input data.