0% found this document useful (0 votes)
28 views9 pages

Unit 4 DSE

Uploaded by

g.mahalakshmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views9 pages

Unit 4 DSE

Uploaded by

g.mahalakshmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

HIERARCHICAL INDEXING

Hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within
a single index

We begin with the standard imports:

import pandas as pd

import numpy as np

A Multiply Indexed Series


The bad way
Suppose you would like to track data about states from two different years. Using the Pandas
tools we’ve already covered, you might be tempted to simply use Python tuples as keys:

In[2]: index = [('California', 2000), ('California', 2010), ('New York', 2000), ('New York',
2010), ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]

pop = pd.Series(populations, index=index)

pop

Out[2]: (California, 2000) 33871648

(California, 2010) 37253956

(New York, 2000) 18976457

(New York, 2010) 19378102

(Texas, 2000) 20851820


(Texas, 2010) 25145561

dtype: int64

Methods of MultiIndex Creation


The most straightforward way to construct a multiply indexed Series or DataFrame is to simply
pass a list of two or more index arrays to the constructor.
For example:

In[12]: df = pd.DataFrame(np.random.rand(4, 2),

index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],

columns=['data1', 'data2'])

df

Out[12]:

data1 data2

a 1 0.554233 0.356072

2 0.925244 0.219474

b 1 0.441759 0.610054

2 0.171495 0.886688

if you pass a dictionary with appropriate tuples as keys, Pandas will auto‐ matically recognize
this and use a MultiIndex by default:

In[13]: data = {('California', 2000): 33871648,

('California', 2010): 37253956,

('Texas', 2000): 20851820,

('Texas', 2010): 25145561,

('New York', 2000): 18976457,

('New York', 2010): 19378102}

pd.Series(data)

Out[13]: California 2000 33871648

2010 37253956

New York 2000 18976457

2010 19378102

Texas 2000 20851820


2010 25145561

dtype: int64

Concatenation of NumPy Arrays


concatenation of NumPy arrays, which can be done via the np.concatenate function

In[4]: x = [1, 2, 3]

y = [4, 5, 6]

z = [7, 8, 9]

np.concatenate([x, y, z])

Out[4]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])


The first argument is a list or tuple of arrays to concatenate. Additionally, it takes an axis
keyword that allows you to specify the axis along which the result will be concatenated:

In[5]: x= [[1, 2],

[3, 4]]

np.concatenate([x, x], axis=1)

Out[5]: array([[1, 2, 1, 2],

[3, 4, 3, 4]])

Simple Concatenation with pd.concat


Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a
number of options that we’ll discuss
# Signature in Pandas v0.18

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,

keys=None, levels=None, names=None, verify_integrity=False, copy=True)

pd.concat() can be used for a simple concatenation of Series or DataFrame objects,


just as np.concatenate() can be used for simple concatenations of arrays:

In[6]: ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])

ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])

pd.concat([ser1, ser2])

Out[6]: 1A

2B

3C

4D

5E

6F

dtype: object

It also works to concatenate higher-dimensional objects, such as DataFrames:

In[7]: df1 = make_df('AB', [1, 2])

df2 = make_df('AB', [3, 4])

print(df1); print(df2); print(pd.concat([df1, df2]))

df1 df2 pd.concat([df1, df2])

A B A B A B

1 A1 B1 3 A3 B3 1 A1 B1

2 A2 B2 4 A4 B4 2 A2 B2

3 A3 B3

4 A4 B4

By default, the concatenation takes place row-wise within the DataFrame (i.e.,axis=0). Like
np.concatenate, pd.concat allows specification of an axis along which concatenation will take
place. Consider the following example:

In[8]: df3 = make_df('AB', [0, 1])

df4 = make_df('CD', [0, 1])


print(df3); print(df4); print(pd.concat([df3, df4], axis='col'))

df3 df4 pd.concat([df3, df4], axis='col')

A B C D A B C D

0 A0 B0 0 C0 D0 0 A0 B0 C0 D0

1 A1 B1 1 C1 D1 1 A1 B1 C1 D1

We could have equivalently specified axis=1; here we’ve used the more intuitive axis='col'.

1. Concatenate (concat)

The concat function in pandas is used to combine two or more DataFrame objects along a
particular axis (either rows or columns). It’s more flexible than append and supports additional
operations such as handling missing data.

Syntax:

import pandas as pd

# Concatenating along rows (axis=0) or columns (axis=1)


result = pd.concat([df1, df2], axis=0)

Key Parameters:

 axis: Determines the axis to concatenate along. axis=0 (default) concatenates along rows,
while axis=1 concatenates along columns.
 join: Specifies how to handle indexes (like a SQL join). Options are outer (default) and
inner.
o outer join: Includes all rows/columns from both datasets and fills in missing
values with NaNs.
o inner join: Only includes rows/columns with matching labels.
 ignore_index: If True, it reindexes the resulting DataFrame.

Example:

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})


df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Concatenate along rows


result = pd.concat([df1, df2], axis=0, ignore_index=True)

2. Append
The append method is specifically for adding one DataFrame to the end of another along rows.
It's essentially a shortcut for pd.concat with axis=0.

Syntax:

# Appending df2 to df1


result = df1.append(df2, ignore_index=True)

Key Parameters:

 ignore_index: If True, it reindexes the resulting DataFrame.

Example:

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})


df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Append df2 to df1


result = df1.append(df2, ignore_index=True)

1. Merge

The merge function is the most versatile way to combine two DataFrame objects in pandas. It
provides various options for specifying how to align the datasets based on common columns or
indexes.

Syntax:

import pandas as pd

# Merging on a specific column


result = pd.merge(df1, df2, on='key_column')

Key Parameters:

 on:Specifies the column or index level names to join on. If not specified, merge will use
overlapping column names.
 how: Specifies the type of join to perform. Options are:
o inner (default): Only includes rows with keys present in both datasets.
o outer: Includes all rows from both datasets and fills in missing values with NaNs.
o left: Includes all rows from the left dataset and matching rows from the right
dataset.
o right: Includes all rows from the right dataset and matching rows from the left
dataset.
 left_on and right_on: Specifies column(s) from the left and right DataFrame to join on, if they
differ.
 suffixes: Specifies suffixes to append to overlapping column names from the left and right
DataFrame.

Example:

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})


df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [24, 25, 26]})

# Merge based on the 'ID' column, with an inner join


result = pd.merge(df1, df2, on='ID', how='inner')

2. Join

The join method is a simplified way to merge two DataFrame objects based on their indexes.
It is convenient for combining datasets where one or both of the datasets use an index as the key
for alignment.

Syntax:

# Joining two DataFrames on their indexes


result = df1.join(df2)

Key Parameters:

 how: Specifies the type of join (left, right, outer, or inner), similar to merge.
 on: Specifies a column to join on, useful if the DataFrame has a different index and you
want to join based on a column.
 lsuffix and rsuffix: Specifies suffixes to append to overlapping column names in the left
and right DataFrame.

Example:

df1 = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie']}, index=[1, 2, 3])


df2 = pd.DataFrame({'Age': [24, 25, 26]}, index=[1, 2, 4])

# Join based on the index with an outer join


result = df1.join(df2, how='outer')

Hierarchical Indexes
Hierarchical Indexes are also known as multi-indexing is setting more than one column
name as the index. In this article, we are going to use homelessness.csv file.
# importing pandas library as alias pd

import pandas as pd

# calling the pandas read_csv() function.

# and storing the result in DataFrame df

df = pd.read_csv('homelessness.csv')

print(df.head())

Hierarchical Indexing in pandas:

# using the pandas set_index() function.

df_ind3 = df.set_index(['region', 'state', 'individuals'])

# we can sort the data by using sort_index()

df_ind3.sort_index()

print(df_ind3.head(10))

Selecting Data in a Hierarchical Index or using the Hierarchical Indexing:


For selecting the data from the dataframe using the .loc() method we have to pass the name of
the indexes in a list.
# selecting the 'Pacific' and 'Mountain'

# region from the dataframe.

# selecting data using level(0) index or main index.

df_ind3_region = df_ind3.loc[['Pacific', 'Mountain']]

print(df_ind3_region.head(10))

You might also like