07 Data Wrangling
07 Data Wrangling
2
Reference
• Wes McKinney, Python for Data Analysis: Data
Wrangling with Pandas, NumPy, and IPython,
O’Reilly Media, 2nd Edition, 2018.
• Chapter 8
• Material: https://github.com/wesm/pypop-book
3
Introduction
4
Data Wrangling
• Data wrangling is the process of
gathering, selecting, and
transforming data to answer an
analytical question.
• In many applications, data may
be spread across a number of
files or databases or be arranged
in a form that is not easy to
analyze.
• This topic focuses on tools to
help combine, join, and
rearrange data.
5
Hierarchical Indexing
6
Introduction
• Hierarchical indexing is having multiple
index levels on an axis.
8
Multi-Level Indexing in pandas Series
• Or we can create a Series
with a MultiIndex by
providing multiple index
levels.
9
Multi-Level Indexing in pandas Series
• In Hierarchical indexing in Series we can access the index, the
elements and use slicing
10
Multi-Level Indexing in pandas Series
• We can use the .unstack() to
reshape the a series with multi-
index
• Moves the innermost level of the
row index to the columns, i.e.
converts rows into columns
11
Multi-Level Indexing in pandas DataFrame
• With a DataFrame, both axes can have a hierarchical index.
• The hierarchical levels can have names (as strings or any
Python objects).
12
Multi-Level Indexing in pandas DataFrame
13
Reordering and Sorting Levels
• You can rearrange the order of the levels on an axis
• The swaplevel takes two level numbers or names and returns a
new object with the levels interchanged.
14
Reordering and Sorting Levels
• You can sort the data by the
values in one specific level.
• sort_index sorts the data
using only the values in a
single level.
• Sorting provides better
performance.
• Try
• df.sort_index(level=1,axis=1)
15
Summary Statistics by Level
• You can apply descriptive and summary statistics in Hierarchal
DataFrames.
• Use groupby() to apply the operation on a certain level.
16
Indexing with a DataFrame’s columns
• You can convert some DataFrame columns to index by
set_index
• But, you have to reset the index first.
17
Combining and Merging Datasets
18
Combining and Merging Datasets
• In real-world data science, information is often
spread across multiple tables.
• Merging and joining help combine data for analysis.
• Join operations are central to relational databases
(e.g., SQL-based).
• Reference: https://www.w3schools.com/sql/sql_join.asp
• Merge or join operations combine datasets by linking
rows using one or more keys.
19
Join Types: Inner Join
• Returns only the rows that
have matching values in both
DataFrames based on the join
key.
20
Join Types: Left Join (Left Merge)
• Returns all rows from the left
DataFrame.
• If there is a match in the right
DataFrame, the corresponding
values are included.
• Unmatched rows from the right
DataFrame are discarded.
• Unmatched columns from the
right DataFrame are filled with
NaN.
21
Join Types: Right Join (Right Merge)
• Returns all rows from the right
DataFrame.
• If there is a match in the left
DataFrame, the corresponding
values are included.
• Unmatched rows from the left
DataFrame are discarded.
• Unmatched columns from the left
DataFrame are filled with NaN.
22
Join Types: Outer Join (Full Merge)
• Returns all rows from both
DataFrames.
• If a row has no match in the other
DataFrame, its missing values are
filled with NaN.
• No rows are discarded, but missing
values appear where matches don’t
exist.
23
Merge and Join in pandas
• Merge or join operations combine datasets by linking rows
using one or more keys.
• In pandas: merge() and join()
df1
df2
24
Merge()
• The “how” parameter
determines the merge
type; the default is inner
join
df1
df2
25
Merge()
• The “on” parameter
determines the key column
for merge.
• If not specified then merge
will be done on all common
columns between the two
DataFrames.
26
Merge()
• If the column names are different in each object, you can
specify them separately.
27
Merge()
• In addition to the default
'inner' join, you can
specify:
• 'left'
• 'right'
• 'outer'
28
Merge()
• Many-to-many joins form the Cartesian product of the
rows.
• each row from the first table is paired with all rows from
the second table that match the join condition
df1 df2
29
Merge()
• To merge two DataFrames using multiple keys (i.e.,
multiple columns), you can pass a list of column names
to the on parameter in the merge() function.
left
right
30
Merge()
• When merging DataFrames in pandas, if there are columns with
the same name that are not part of the key, pandas automatically
adds suffixes (_x, _y) to differentiate them.
left
right
31
Merge()
• You can also use the suffixes option for specifying strings to
append to overlapping names in the left and right DataFrame
objects.
left
right
32
Merging on Index
• The merge key(s) can be in the index.
• You can pass left_index=True or right_index=True (or both).
left
right
33
Merge function arguments
34
Merging on Index
• When indices are similar, you can also use the join method.
• You can also join multiple DataFrames at once.
left2
right2
35
Concatenating Series
• Concatenation, binding, or stacking is supported by concat
function.
• You can glue multiple Series.
• You can use keys to identify the origin of each entry.
s1
s2
s3
36
Concatenating DataFrames Along an Axis
• concat also works on DataFrames.
• When the row index has no relevant data, pass
ignore_index=True.
df1
df2
Stacking
37
Combining Data with Overlap
• When you want to concatenate two objects and patching
the NA values from the caller object, use combine_first.
• The combine_first() method is used to fill missing values
(NaN) in one DataFrame with values from another
DataFrame. It works like a prioritized merge.
df1 df2
38
Reshaping and Pivoting
39
Reshaping and Pivoting
• Reshaping and pivoting are data transformation techniques used
in pandas to change the structure/layout of a DataFrame, making
it easier to analyze and visualize data.
• Changing from wide format to long format (or vice versa).
• Reorganizing rows and columns.
• Aggregating or splitting data.
42
Reshaping with Hierarchical Indexing
• You can (un)stack a different level by passing a level number or
name.
or
43
Reshaping data between wide and long
formats.
• The pivot() function
transforms rows into
columns, converting long-
format data into wide format.
44
Pivoting Long to Wide Format
• When you want to unstack based on
keys (not index), use pivot.
• pivot() requires that each (index,
columns) pair has only one unique
value.
• pivot() takes the following parameters:
• Index: The column(s) whose unique
values will become the new row indices.
• Columns: The column whose unique
values will become the new column
headers.
• Values: The column whose values will
populate the table. 45
Pivoting Wide to Long Format
• When you want to stack based on
keys (not index), use melt.
• melt() takes the following
parameters:
• id_vars: Column(s) to keep unchanged
(identifier columns).
• value_vars: Column(s) to unpivot
(optional; if not provided, all remaining
columns are melted).
• var_name: Name of the new column
that stores former column names
(default is "variable").
• value_nameName of the new column
that stores the corresponding values
(default is "value"). 46
Exercises
47
Exercise 1
• Create a Series with a multi-level index where:
• First level (Country): ['USA', 'USA', 'Canada', 'Canada', 'Germany', 'Germany']
• Second level (Year): [2015, 2020, 2015, 2020, 2015, 2020]
• Values (GDP in Trillions): [18.2, 21.4, 1.6, 1.9, 3.4, 3.8]
• Name the index levels "Country" and "Year".
• Retrieve the GDP of Canada in 2020.
• Convert the Series a DataFrame with a column named "GDP
(Trillions)".
• Reset the index so that "Country" and "Year" become regular columns.
• Swap the levels of the multi-index so that "Year" comes first and
"Country" comes second.
48
Exercise 2
Given the following two data frames answer the following
questions:
df1 = pd.DataFrame({ df2 = pd.DataFrame({
'Student_ID': [101, 102, 103, 104], 'Student_ID': [101, 102, 105, 106],
'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Course': ['Math', 'Physics', 'Chemistry',
'Biology']
'Age': [20, 21, 22, 23]
})
})
df_filled_scores = pd.DataFrame({
• Merge df_students and df_courses on Student_ID using: 'Student_ID': [201, 202, 203,
204],
• Inner join (students only in both tables). 'Math_Score': [88, 75, None, 80],
'Physics_Score': [82, None, 79,
• Outer join (all students from both tables). None]
})
• Left join (keep all students from df_students).
• Combine df_students and df_extra.
• Fill missing values in df_scores using df_filled_scores. 50
Exercise 4
Create a multi-index DataFrame for a
company's sales in multiple regions: sales = pd.DataFrame({
• Pivot the sales table so that: 'Region': ['East', 'East', 'West', 'West', 'East', 'West'],
'Year': [2023, 2024, 2023, 2024, 2023, 2024],
• "Year" is the index. 'Product': ['A', 'B', 'A', 'B', 'B', 'A'],
• "Region" is in columns. 'Sales': [100, 200, 150, 250, None, None] # Missing Sales
values for certain products
• "Sales" values fill the table. })
• Melt the DataFrame back into long
df_profit = pd.DataFrame({
format. 'Region': ['East', 'East', 'West', 'West', 'East', 'West'],
• Merge this melted DataFrame with 'Year': [2023, 2024, 2023, 2024, 2023, 2024],
the profit DataFrame. 'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Profit': [20, 35, 25, None, 30, None] # Missing Profit values for certain
• Fill missing values in the merged products
})
DataFrame
51