DSP Unit-5 Updated
DSP Unit-5 Updated
Hierarchical Indexing
import pandas as pd
import numpy as np
df = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 3,
1, 2, 2, 3]])
print(df)
Output:
a 1 0.510547
2 0.099520
3 -0.511527
b 1 -0.393166
3 0.807105
c 1 -1.297928
2 0.658603
d 2 1.371548
3 -1.245286
dtype: float64
print(df.index)
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
1 0.660261
3 1.552526
dtype: float64
print(df['b':'c'])
b 1 0.010531
3 -0.976936
c 1 -0.317225
2 1.423272
dtype: float64
print(df.loc[['b','d']])
b 1 0.277812
3 1.540045
d 2 1.583744
3 1.448799
dtype: float64
print(df.loc[:, 2])
a -0.329168
c -0.015847
d -0.115682
dtype: float64
df.unstack()
1 2 3
a -0.832795 -1.697236 0.321600
b 0.060409 NaN 1.461795
c -0.743408 -0.573688 NaN
d NaN -1.355128 -0.519725
a 1 0.510547
2 0.099520
3 -0.511527
b 1 -0.393166
3 0.807105
c 1 -1.297928
2 0.658603
d 2 1.371548
3 -1.245286
dtype: float64
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Mango', 'Apple', 'Mango'], ['Green', 'Red', 'Green']])
print(df)
The hierarchical levels can have names (as strings or any Python objects). If so, these will
show up in the console output:
print(df['Mango'])
At times you will need to rearrange the order of the levels on an axis or sort the data by
the values in one specific level. The swaplevel takes two level numbers or names and
returns a new object with the levels interchanged (but the data is otherwise unaltered):
Print(df.swaplevel('key1', 'key2'))
sort_index, on the other hand, sorts the data using only the values in a single level.
print(df.sort_index(level=1))
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
DataFrame’s set_index function will create a new DataFrame using one or more of its
columns as the index:
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
By default the columns are removed from the DataFrame, though you can leave them in:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
reset_index, on the other hand, does the opposite of set_index; the hierarchical index
levels are moved into the columns:
print(df1.reset_index())
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
pandas.merge connects rows in DataFrames based on one or more keys. This will
be familiar to users of SQL or other relational databases, as it implements database
join operations.
pandas.concat concatenates or “stacks” together objects along an axis.
The combine_first instance method enables splicing together overlapping data to
fill in missing values in one object with values from another.
Join Operation
print(df1)
print(df2)
df3 = df1.join(df2)
print(df3)
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
Inner Join
Inner join is the most common type of join you’ll be working
with. It returns a Dataframe with only those rows that have
common characteristics. This is similar to the intersection of
two sets.
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
Sri Ch, Chandra Sekhar, IT - AITAM Page 8
UNIT - 5
Merging Operation:
The merge() function used to merge the DataFrames with database-style join such
as inner join, outer join, left join, right join.
Combining exactly two DataFrames.
The join is done on columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored.
If joining indexes on indexes or indexes on a column, the index will be passed on.
import pandas as pd
df1 = pd.DataFrame({
'id':[1,2,3,4], 'sub_id':['s1','s2','s4','s6'], 'marks': [55, 77, 88, 66]})
df2 = pd.DataFrame({
'id':[1,2,3,4], 'sub_id':['s2','s4','s3','s6'], 'marks': [60, 40, 50, 70]})
print (df1)
print (df2)
id sub_id marks
0 1 s1 55
1 2 s2 77
2 3 s4 88
3 4 s6 66
id sub_id marks
0 1 s2 60
1 2 s4 40
2 3 s3 50
3 4 s6 70
on:- This specifies the column or index on which the merge is supposed to happen. If
the value for on is None, the dataframe will be merged based on columns in both
available dataframes.
left_on:- When this parameter is selected columns or indexes are merged in the
first dataframe.
right_on:-When this parameter is selected columns or indexes are merged in the
second dataframe.
If the column names are different in each object, you can specify them separately:
import pandas as pd
left: It use only keys from the left frame, similar to left outer join
right: It use only keys from the right frame, similar to right outer join
outer: It used the union of keys from both frames, similar to a full outer join.
inner: It use the intersection of keys from both frames, similar to a inner join
import pandas as pd
Suffixes:- It is the sequence of length two. The values are of string datatype and
indicate the suffix to be added to the overlapping column names on the first and
second respectively after the dataframes are merged. Its default value is (“_x”, “_y”).
import pandas as pd
df1 = pd.DataFrame({'key1': ['f1', 'f1', 'b1'], 'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]})
df2 = pd.DataFrame({'key1': ['f1', 'f1', 'b1', 'b1'], 'key2': ['one', 'one', 'one', 'two'], 'rval':
[4, 5, 6, 7]})
Concatenate Operation:
import pandas as pd
import numpy as np
a = np.arange(12).reshape((3, 4))
print(a)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[ 0 1 2 3 0 1 2 3]
[ 4 5 6 7 4 5 6 7]
[ 8 9 10 11 8 9 10 11]]
In the context of pandas objects such as Series and DataFrame, having labeled axes
enable you to further generalize array concatenation. In particular, you have a number
of additional things to think about:
If the objects are indexed differently on the other axes, should we combine the
distinct elements in these axes or use only the shared values (the intersection)?
Do the concatenated chunks of data need to be identifiable in the resulting object?
Does the “concatenation axis” contain data that needs to be preserved? In many
cases, the default integer labels in a DataFrame are best discarded during
concatenation.
The concat() function in pandas provides a consistent way to address each of these
concerns.
import pandas as pd
#import numpy as np
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
By default concat() works along axis=0, producing another Series. If you pass axis=1,
the result will instead be a DataFrame (axis=1 is the columns):
s4 = pd.concat([s1, s3])
s = pd.concat([s1, s4], axis=1)
print(s)
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
0 1
a 0 0
b 1 1
import pandas as pd
import numpy as np
There is another data combination situation that can’t be expressed as either a merge or
concatenation operation. You may have two datasets whose indexes overlap in full or
part. As a motivating example, consider NumPy’s where function, which performs the
array-oriented equivalent of an if-else expression:
import pandas as pd
import numpy as np
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64), index=['f', 'e', 'd', 'c', 'b', 'a'])
print(a)
print(b)
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a 5.0
dtype: float64
c = b.combine_first(a)
print(c)
Sri Ch, Chandra Sekhar, IT - AITAM Page 15
UNIT - 5
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a 5.0
dtype: float64
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan], 'b': [np.nan, 2., np.nan, 6.], 'c': range(2,
18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.], 'b': [np.nan, 3., 4., 6., 8.]})
print(df1)
print(df2)
c = df1.combine_first(df2)
print(c)
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
import pandas as pd
import numpy as np
Using the stack method on this data pivots the columns into the rows, producing a
Series:
r = df.stack()
print(r)
state number
AP one 0
two 1
three 2
UP one 3
two 4
three 5
dtype: int64
From a hierarchically indexed Series, you can rearrange the data back into a Data‐
Frame with unstack:
b = r.unstack()
print(b)
s = r.unstack(0)
print(s)
state AP UP
number
one 0 3
two 1 4
three 2 5
Unstacking might introduce missing data if all of the values in the level aren’t found in
each of the subgroups:
import pandas as pd
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
print(df.unstack())
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
Pivoting
A common way to store multiple time series in databases and CSV is in so-called long or
stacked format. Let’s load some example data and do a small amount of time series
wrangling and other data cleaning:
data = pd.read_csv('examples/macrodata.csv')
print(data.head())
print(ldata[:10])
This is the so-called long format for multiple time series, or other observational data with
two or more keys (here, our keys are date and item). Each row in the table represents a
single observation.
In some cases, the data may be more difficult to work with in this format; you might
prefer to have a DataFrame containing one column per distinct item value indexed by
timestamps in the date column. Data‐Frame’s pivot method performs exactly this
transformation:
Sri Ch, Chandra Sekhar, IT - AITAM Page 19
UNIT - 5
The first two values passed are the columns to be used respectively as the row and
column index, then finally an optional value column to fill the DataFrame. Suppose you
had two value columns that you wanted to reshape simultaneously:
ldata['value2'] = np.random.randn(len(ldata))
print(ldata[:10])
By omitting the last argument, you obtain a DataFrame with hierarchical columns:
print(pivoted['value'][:5])
Note that pivot is equivalent to creating a hierarchical index using set_index followed
by a call to unstack:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'], 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print(df)
key A B C
0 foo 1 4 7
1 bar 2 5 8
2 baz 3 6 9
The 'key' column may be a group indicator, and the other columns are data values.
When using pandas.melt, we must indicate which columns (if any) are group indicators.
Let’s use 'key' as the only group indicator here:
variable A B C
key
bar 2 5 8
baz 3 6 9
foo 1 4 7
Since the result of pivot creates an index from the column used as the row labels, we may
want to use reset_index to move the data back into a column:
print(reshaped.reset_index())