0% found this document useful (0 votes)
7 views

Unit III - Pandas - Data Manipulation Using Python

Uploaded by

SARAVANAN
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Unit III - Pandas - Data Manipulation Using Python

Uploaded by

SARAVANAN
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit - III

Pandas

Introducing Pandas Objects


 Pandas objects is enhanced versions of NumPy structured arrays in which the rows and
columns are identified with labels rather than simple integer indices.

 Three fundamental Pandas data structures: the Series, DataFrame, and Index.

import numpy as np
import pandas as pd

The Pandas Series Object


 A Pandas Series is a one-dimensional array of indexed data.

 It can be created from a list or array as follows:

data = pd.Series([0.25, 0.5, 0.75, 1.0])


data
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64

data.values
array([ 0.25, 0.5 , 0.75, 1. ])

data.index
RangeIndex(start=0, stop=4, step=1)

data[1]
0.5

data[1:3]
1 0.50
2 0.75
dtype: float64
Series as generalized NumPy array
 We can use strings as an index:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64

data['b']
0.5

 We can even use non-contiguous or non-sequential indices:

data = pd.Series([0.25, 0.5, 0.75, 1.0],


index=[2, 5, 3, 7])
data
2 0.25
5 0.50
3 0.75
7 1.00
dtype: float64

data[5]
0.5

Series as specialized dictionary


population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64

population['California']
38332521

population['California':'Illinois']
California 38332521
Florida 19552860
Illinois 12882135
dtype: int64
Constructing Series objects

>>> pd.Series(data, index=index)

 Where index is an optional argument and data can be one of many entities.

pd.Series([2, 4, 6])
0 2
1 4
2 6
dtype: int64

pd.Series(5, index=[100, 200, 300])


100 5
200 5
300 5
dtype: int64

pd.Series({2:'a', 1:'b', 3:'c'})


1 b
2 a
3 c
dtype: object

pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])


3 c
2 a
dtype: object

The Pandas DataFrame Object


 Series is an analog of a one-dimensional array with flexible indices,

 DataFrame is an analog of a two-dimensional array with both flexible row indices and
flexible column names.

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,


'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
states = pd.DataFrame({'population': population,
'area': area})
states
area population
California42396738332521
Florida 17031219552860
Illinois 14999512882135
New York14129719651127
Texas 69566226448193

states.index
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

states.columns
Index(['area', 'population'], dtype='object')

DataFrame as specialized dictionary


states['area']
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64

Constructing DataFrame objects


 A Pandas DataFrame can be constructed in a variety of ways.

From a single Series object


 A DataFrame is a collection of Series objects, and a single-column DataFrame can be
constructed from a single Series:

pd.DataFrame(population, columns=['population'])
population

California38332521

Florida 19552860

Illinois 12882135

New York19651127

Texas 26448193
From a list of dicts
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
a b

0 0 0

1 1 2

2 2 4

From a dictionary of Series objects


pd.DataFrame({'population': population,
'area': area})
area population
California42396738332521
Florida 17031219552860
Illinois 14999512882135
New York14129719651127
Texas 69566226448193

From a two-dimensional NumPy array


pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])

foo bar

a 0.865257 0.213169

b 0.442759 0.108267

c 0.047110 0.905718

From a NumPy structured array


A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
array([(0, 0.0), (0, 0.0), (0, 0.0)],
dtype=[('A', '<i8'), ('B', '<f8')])

pd.DataFrame(A)

A B

0 0 0.0

1 0 0.0

2 0 0.0

The Pandas Index Object


 Both the Series and DataFrame objects contain an explicit index that lets you reference
and modify data.
 Index object is an interesting structure in itself, and it can be thought of either as
an immutable array or as an ordered set.

ind = pd.Index([2, 3, 5, 7, 11])


ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')

Index as immutable array


ind[1]
3

ind[::2]
Int64Index([2, 5, 11], dtype='int64')

print(ind.size, ind.shape, ind.ndim, ind.dtype)


5 (5,) 1 int64

 One difference between Index objects and NumPy arrays is that indices are immutable–
that is, they cannot be modified via the normal means:

ind[1] = 0
Index as ordered set
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
Int64Index([3, 5, 7], dtype='int64')

indA | indB # union


Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

indA ^ indB # symmetric difference


Int64Index([1, 2, 9, 11], dtype='int64')

Data Selection in Series


Indexers: loc, iloc, and ix

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])


data
1 a
3 b
5 c
dtype: object

# explicit index when indexing


data[1]
'a'
# implicit index when slicing
data[1:3]
3 b
5 c
dtype: object

 First, the loc attribute allows indexing and slicing that always references the explicit
index:

data.loc[1]
'a'

data.loc[1:3]
1 a
3 b
dtype: object

 The iloc attribute allows indexing and slicing that always references the implicit Python-
style index:

data.iloc[1]
'b'
data.iloc[1:3]
3 b
5 c
dtype: object

 A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to
standard []-based indexing.
 The purpose of the ix indexer will become more apparent in the context
of DataFrame objects

Operating on Data in Pandas

 One of the essential pieces of NumPy is the ability to perform quick element-wise
operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with
more sophisticated operations (trigonometric functions, exponential and logarithmic
functions, etc.).

 Pandas inherit much of this functionality from NumPy, and the ufuncs.

 Pandas include a couple useful twists, however:

i. For unary operations like negation and trigonometric functions, these ufuncs
will preserve index and column labels in the output.
ii. For binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.

Handling Missing Data


 The difference between data found in many tutorials and data in the real world is that
real-world data is rarely clean and homogeneous.

 In particular, many interesting datasets will have some amount of data missing.

 Different data sources may indicate missing data in different ways.

 Here we will see some general considerations for missing data, and how Pandas chooses
to represent it.

 Demonstrate some built-in Pandas tools for handling missing data in Python.
 We refer missing data in general as null, NaN, or NA values.

 There are a number of schemes that have been developed to indicate the presence of
missing data in a table or DataFrame.

 Two strategies: using a mask that globally indicates missing values, or choosing
a sentinel value (indicating a missing integer value with -9999 or NaN or None) that
indicates a missing entry.

Operating on Null Values


 Pandas treats None and NaN as essentially interchangeable for indicating missing or null
values.

 To facilitate this convention, there are several useful methods for detecting, removing,
and replacing null values in Pandas data structures. They are:
 isnull(): Generate a boolean mask indicating missing values
 notnull(): Opposite of isnull()
 dropna(): Return a filtered version of the data
 fillna(): Return a copy of the data with missing values filled or imputed

Hierarchical Indexing
 Pandas provide objects that handle three-dimensional and four-dimensional data.

 Common pattern in practice is to make use of hierarchical indexing (also known


as multi-indexing) to incorporate multiple index levels within a single index.

 In this way, higher-dimensional data can be compactly represented within the familiar
one-dimensional Series and two-dimensional DataFrame objects.

Pandas MultiIndex
import pandas as pd
import numpy as np
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
(California, 2000) 33871648
(California, 2010) 37253956
(New York, 2000) 18976457
(New York, 2010) 19378102
(Texas, 2000) 20851820
(Texas, 2010) 25145561
dtype: int64

index = pd.MultiIndex.from_tuples(index)
index
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

 The MultiIndex contains multiple levels of indexing–in this case, the state names and the
years, as well as multiple labels for each data point which encode these levels.

 If we re-index our series with this MultiIndex, we see the hierarchical representation of
the data:
pop = pop.reindex(index)
pop
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64

pop[:, 2010]
California 37253956
New York 19378102
Texas 25145561
dtype: int64

MultiIndex as extra dimension


 The unstack() method will quickly convert a multiply indexed Series into a
conventionally indexed DataFrame:

pop_df = pop.unstack()
pop_df

2000 2010

California3387164837253956

New York 1897645719378102

Texas 2085182025145561
 The stack() method provides the opposite operation:

pop_df.stack()
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64

pop_df = pd.DataFrame({'total': pop,


'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df

total under18

California2000338716489267089

2010372539569284094

New York 2000189764574687374

2010193781024318033

Texas 2000208518205906301

2010251455616879014

Methods of MultiIndex Creation


 The most straightforward way to construct a multiply indexed Series or DataFrame is to
simply pass a list of two or more index arrays to the constructor.

df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df

data1 data2

a 10.5542330.356072

20.9252440.219474

b10.4417590.610054

20.1714950.886688
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64

Explicit MultiIndex constructors

pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])


MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])


MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex level names


 Sometimes it is convenient to name the levels of the MultiIndex. This can be
accomplished by passing the names argument to any of the
above MultiIndex constructors.

pop.index.names = ['state', 'year']


pop
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64

Indexing and Slicing a MultiIndex


pop
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64

pop.loc['California':'New York']
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64

Combining Datasets:
Concat and Append

import pandas as pd
import numpy as np

x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
array([[1, 2, 1, 2],
[3, 4, 3, 4]])

Simple Concatenation with pd.concat


ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
1 A
2 B
3 C
4 D
5 E
6 F
dtype: object

df1 = make_df('AB', [1, 2])


df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")

Concatenation with joins


df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')

display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")

The append() method


display('df1', 'df2', 'df1.append(df2)')
Merge and Join

Categories of Joins

 The pd.merge() function implements a number of types of joins: the one-to-one, many-to-
one, and many-to-many joins.

 All three types of joins are accessed via an identical call to the pd.merge() interface; the
type of join performed depends on the form of the input data.

You might also like