Unit 4 DSE
Unit 4 DSE
Hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within
a single index
import pandas as pd
import numpy as np
In[2]: index = [('California', 2000), ('California', 2010), ('New York', 2000), ('New York',
2010), ('Texas', 2000), ('Texas', 2010)]
pop
dtype: int64
columns=['data1', 'data2'])
df
Out[12]:
data1 data2
a 1 0.554233 0.356072
2 0.925244 0.219474
b 1 0.441759 0.610054
2 0.171495 0.886688
if you pass a dictionary with appropriate tuples as keys, Pandas will auto‐ matically recognize
this and use a MultiIndex by default:
pd.Series(data)
2010 37253956
2010 19378102
dtype: int64
In[4]: x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
[3, 4]]
[3, 4, 3, 4]])
pd.concat([ser1, ser2])
Out[6]: 1A
2B
3C
4D
5E
6F
dtype: object
A B A B A B
1 A1 B1 3 A3 B3 1 A1 B1
2 A2 B2 4 A4 B4 2 A2 B2
3 A3 B3
4 A4 B4
By default, the concatenation takes place row-wise within the DataFrame (i.e.,axis=0). Like
np.concatenate, pd.concat allows specification of an axis along which concatenation will take
place. Consider the following example:
A B C D A B C D
0 A0 B0 0 C0 D0 0 A0 B0 C0 D0
1 A1 B1 1 C1 D1 1 A1 B1 C1 D1
We could have equivalently specified axis=1; here we’ve used the more intuitive axis='col'.
1. Concatenate (concat)
The concat function in pandas is used to combine two or more DataFrame objects along a
particular axis (either rows or columns). It’s more flexible than append and supports additional
operations such as handling missing data.
Syntax:
import pandas as pd
Key Parameters:
axis: Determines the axis to concatenate along. axis=0 (default) concatenates along rows,
while axis=1 concatenates along columns.
join: Specifies how to handle indexes (like a SQL join). Options are outer (default) and
inner.
o outer join: Includes all rows/columns from both datasets and fills in missing
values with NaNs.
o inner join: Only includes rows/columns with matching labels.
ignore_index: If True, it reindexes the resulting DataFrame.
Example:
2. Append
The append method is specifically for adding one DataFrame to the end of another along rows.
It's essentially a shortcut for pd.concat with axis=0.
Syntax:
Key Parameters:
Example:
1. Merge
The merge function is the most versatile way to combine two DataFrame objects in pandas. It
provides various options for specifying how to align the datasets based on common columns or
indexes.
Syntax:
import pandas as pd
Key Parameters:
on:Specifies the column or index level names to join on. If not specified, merge will use
overlapping column names.
how: Specifies the type of join to perform. Options are:
o inner (default): Only includes rows with keys present in both datasets.
o outer: Includes all rows from both datasets and fills in missing values with NaNs.
o left: Includes all rows from the left dataset and matching rows from the right
dataset.
o right: Includes all rows from the right dataset and matching rows from the left
dataset.
left_on and right_on: Specifies column(s) from the left and right DataFrame to join on, if they
differ.
suffixes: Specifies suffixes to append to overlapping column names from the left and right
DataFrame.
Example:
2. Join
The join method is a simplified way to merge two DataFrame objects based on their indexes.
It is convenient for combining datasets where one or both of the datasets use an index as the key
for alignment.
Syntax:
Key Parameters:
how: Specifies the type of join (left, right, outer, or inner), similar to merge.
on: Specifies a column to join on, useful if the DataFrame has a different index and you
want to join based on a column.
lsuffix and rsuffix: Specifies suffixes to append to overlapping column names in the left
and right DataFrame.
Example:
Hierarchical Indexes
Hierarchical Indexes are also known as multi-indexing is setting more than one column
name as the index. In this article, we are going to use homelessness.csv file.
# importing pandas library as alias pd
import pandas as pd
df = pd.read_csv('homelessness.csv')
print(df.head())
df_ind3.sort_index()
print(df_ind3.head(10))
print(df_ind3_region.head(10))