0% found this document useful (0 votes)
9 views51 pages

07 Data Wrangling

The document provides an overview of data wrangling techniques, focusing on hierarchical indexing, combining and merging datasets, and reshaping data using pandas. It covers various join types, including inner, left, right, and outer joins, as well as methods for concatenating and reshaping data. Additionally, it includes exercises to reinforce the concepts discussed.

Uploaded by

abdklaib233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views51 pages

07 Data Wrangling

The document provides an overview of data wrangling techniques, focusing on hierarchical indexing, combining and merging datasets, and reshaping data using pandas. It covers various join types, including inner, left, right, and outer joins, as well as methods for concatenating and reshaping data. Additionally, it includes exercises to reinforce the concepts discussed.

Uploaded by

abdklaib233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Wrangling: Join,

Combine, and Reshape

Prof. Gheith Abandah


Adapted by Prof. Iyad Jafar and Dr. Mohammad Abdel-Majeed

Developing Curricula for Artificial Intelligence and Robotics (DeCAIR)


618535-EPP-1-2020-1-JO-EPPKA2-CBHE-JP 1
Outline
• Introduction
• Hierarchical Indexing
• Combining and Merging Datasets
• Reshaping and Pivoting

2
Reference
• Wes McKinney, Python for Data Analysis: Data
Wrangling with Pandas, NumPy, and IPython,
O’Reilly Media, 2nd Edition, 2018.
• Chapter 8
• Material: https://github.com/wesm/pypop-book

3
Introduction

4
Data Wrangling
• Data wrangling is the process of
gathering, selecting, and
transforming data to answer an
analytical question.
• In many applications, data may
be spread across a number of
files or databases or be arranged
in a form that is not easy to
analyze.
• This topic focuses on tools to
help combine, join, and
rearrange data.

5
Hierarchical Indexing

6
Introduction
• Hierarchical indexing is having multiple
index levels on an axis.

• Why Hierarchical indexing ?


• Allows you to work with higher
dimensional data in a lower dimensional
form.
• Gives cleaner and more organized data
representation
• Allows selecting subsets of data easily
• Reduces redundancy and optimizes
lookups.
7
Multi-Level Indexing in pandas Series
• Hierarchical indexing
(MultiIndex) allows a
pandas Series to have
multiple levels of index
labels.

• We can create a series


with milti-index during
creation.

8
Multi-Level Indexing in pandas Series
• Or we can create a Series
with a MultiIndex by
providing multiple index
levels.

9
Multi-Level Indexing in pandas Series
• In Hierarchical indexing in Series we can access the index, the
elements and use slicing

10
Multi-Level Indexing in pandas Series
• We can use the .unstack() to
reshape the a series with multi-
index
• Moves the innermost level of the
row index to the columns, i.e.
converts rows into columns

• We can reverse the unstacking


using .stack()
• Converts columns into rows
• The last level of the column index is
moved into the row index

11
Multi-Level Indexing in pandas DataFrame
• With a DataFrame, both axes can have a hierarchical index.
• The hierarchical levels can have names (as strings or any
Python objects).

12
Multi-Level Indexing in pandas DataFrame

13
Reordering and Sorting Levels
• You can rearrange the order of the levels on an axis
• The swaplevel takes two level numbers or names and returns a
new object with the levels interchanged.

14
Reordering and Sorting Levels
• You can sort the data by the
values in one specific level.
• sort_index sorts the data
using only the values in a
single level.
• Sorting provides better
performance.
• Try
• df.sort_index(level=1,axis=1)
15
Summary Statistics by Level
• You can apply descriptive and summary statistics in Hierarchal
DataFrames.
• Use groupby() to apply the operation on a certain level.

16
Indexing with a DataFrame’s columns
• You can convert some DataFrame columns to index by
set_index
• But, you have to reset the index first.

17
Combining and Merging Datasets

18
Combining and Merging Datasets
• In real-world data science, information is often
spread across multiple tables.
• Merging and joining help combine data for analysis.
• Join operations are central to relational databases
(e.g., SQL-based).
• Reference: https://www.w3schools.com/sql/sql_join.asp
• Merge or join operations combine datasets by linking
rows using one or more keys.
19
Join Types: Inner Join
• Returns only the rows that
have matching values in both
DataFrames based on the join
key.

• Unmatched rows from both


DataFrames are discarded.

20
Join Types: Left Join (Left Merge)
• Returns all rows from the left
DataFrame.
• If there is a match in the right
DataFrame, the corresponding
values are included.
• Unmatched rows from the right
DataFrame are discarded.
• Unmatched columns from the
right DataFrame are filled with
NaN.
21
Join Types: Right Join (Right Merge)
• Returns all rows from the right
DataFrame.
• If there is a match in the left
DataFrame, the corresponding
values are included.
• Unmatched rows from the left
DataFrame are discarded.
• Unmatched columns from the left
DataFrame are filled with NaN.
22
Join Types: Outer Join (Full Merge)
• Returns all rows from both
DataFrames.
• If a row has no match in the other
DataFrame, its missing values are
filled with NaN.
• No rows are discarded, but missing
values appear where matches don’t
exist.

23
Merge and Join in pandas
• Merge or join operations combine datasets by linking rows
using one or more keys.
• In pandas: merge() and join()

df1

df2

24
Merge()
• The “how” parameter
determines the merge
type; the default is inner
join

df1

df2

25
Merge()
• The “on” parameter
determines the key column
for merge.
• If not specified then merge
will be done on all common
columns between the two
DataFrames.

26
Merge()
• If the column names are different in each object, you can
specify them separately.

27
Merge()
• In addition to the default
'inner' join, you can
specify:
• 'left'
• 'right'
• 'outer'

28
Merge()
• Many-to-many joins form the Cartesian product of the
rows.
• each row from the first table is paired with all rows from
the second table that match the join condition
df1 df2

29
Merge()
• To merge two DataFrames using multiple keys (i.e.,
multiple columns), you can pass a list of column names
to the on parameter in the merge() function.

left

right

30
Merge()
• When merging DataFrames in pandas, if there are columns with
the same name that are not part of the key, pandas automatically
adds suffixes (_x, _y) to differentiate them.

left

right

31
Merge()
• You can also use the suffixes option for specifying strings to
append to overlapping names in the left and right DataFrame
objects.
left

right

32
Merging on Index
• The merge key(s) can be in the index.
• You can pass left_index=True or right_index=True (or both).

left

right

33
Merge function arguments

34
Merging on Index
• When indices are similar, you can also use the join method.
• You can also join multiple DataFrames at once.

left2

right2

35
Concatenating Series
• Concatenation, binding, or stacking is supported by concat
function.
• You can glue multiple Series.
• You can use keys to identify the origin of each entry.

s1

s2

s3
36
Concatenating DataFrames Along an Axis
• concat also works on DataFrames.
• When the row index has no relevant data, pass
ignore_index=True.

df1

df2
Stacking

37
Combining Data with Overlap
• When you want to concatenate two objects and patching
the NA values from the caller object, use combine_first.
• The combine_first() method is used to fill missing values
(NaN) in one DataFrame with values from another
DataFrame. It works like a prioritized merge.

df1 df2

38
Reshaping and Pivoting

39
Reshaping and Pivoting
• Reshaping and pivoting are data transformation techniques used
in pandas to change the structure/layout of a DataFrame, making
it easier to analyze and visualize data.
• Changing from wide format to long format (or vice versa).
• Reorganizing rows and columns.
• Aggregating or splitting data.

• Pandas objects can be reshaped using:


• T (returns a transposed view)
• stack: “rotates” or pivots from the columns in the data to the rows.
• unstack: pivots from the rows into the columns.
40
Reshaping with Hierarchical Indexing
• By default the innermost level is (un)stacked.
• When you unstack, the level unstacked becomes the innermost.

totals over all states.


41
Reshaping with Hierarchical Indexing
• Unstacking might introduce missing data.

42
Reshaping with Hierarchical Indexing
• You can (un)stack a different level by passing a level number or
name.
or

43
Reshaping data between wide and long
formats.
• The pivot() function
transforms rows into
columns, converting long-
format data into wide format.

• The melt() function


transforms wide-format data
into long format by
converting columns into rows.

44
Pivoting Long to Wide Format
• When you want to unstack based on
keys (not index), use pivot.
• pivot() requires that each (index,
columns) pair has only one unique
value.
• pivot() takes the following parameters:
• Index: The column(s) whose unique
values will become the new row indices.
• Columns: The column whose unique
values will become the new column
headers.
• Values: The column whose values will
populate the table. 45
Pivoting Wide to Long Format
• When you want to stack based on
keys (not index), use melt.
• melt() takes the following
parameters:
• id_vars: Column(s) to keep unchanged
(identifier columns).
• value_vars: Column(s) to unpivot
(optional; if not provided, all remaining
columns are melted).
• var_name: Name of the new column
that stores former column names
(default is "variable").
• value_nameName of the new column
that stores the corresponding values
(default is "value"). 46
Exercises

47
Exercise 1
• Create a Series with a multi-level index where:
• First level (Country): ['USA', 'USA', 'Canada', 'Canada', 'Germany', 'Germany']
• Second level (Year): [2015, 2020, 2015, 2020, 2015, 2020]
• Values (GDP in Trillions): [18.2, 21.4, 1.6, 1.9, 3.4, 3.8]
• Name the index levels "Country" and "Year".
• Retrieve the GDP of Canada in 2020.
• Convert the Series a DataFrame with a column named "GDP
(Trillions)".
• Reset the index so that "Country" and "Year" become regular columns.
• Swap the levels of the multi-index so that "Year" comes first and
"Country" comes second.
48
Exercise 2
Given the following two data frames answer the following
questions:
df1 = pd.DataFrame({ df2 = pd.DataFrame({
'Student_ID': [101, 102, 103, 104], 'Student_ID': [101, 102, 105, 106],
'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Course': ['Math', 'Physics', 'Chemistry',
'Biology']
'Age': [20, 21, 22, 23]
})
})

• Merge df1 and df2 on 'Student_ID', keeping only students


that exist in both DataFrames
• Merge again, but keep all students
• Merge again, but keep all students from df1 (left join).
49
Exercise 3
Given the following two data frames answer the following questions:
df_students = pd.DataFrame({ df_courses = pd.DataFrame({ df_scores = pd.DataFrame({ df_extra =
pd.DataFrame({
'Student_ID': [201, 202, 203, 204], 'Student_ID': [201, 202, 205, 206], 'Student_ID': [201, 202, 203, 204], 'Student_ID': [207,
'Name': ['Alice', 'Bob', 'Charlie', 208],
'David'], 'Course': ['Math', 'Physics', 'Math_Score': [85, None, 78, None],
'Chemistry', 'Biology']}) 'Name': ['Eve', 'Frank'],
'Age': [20, 21, 22, 23] }) 'Physics_Score': [None, 90, None, 88] })
'Age': [24, 25]})

df_filled_scores = pd.DataFrame({
• Merge df_students and df_courses on Student_ID using: 'Student_ID': [201, 202, 203,
204],

• Inner join (students only in both tables). 'Math_Score': [88, 75, None, 80],
'Physics_Score': [82, None, 79,
• Outer join (all students from both tables). None]
})
• Left join (keep all students from df_students).
• Combine df_students and df_extra.
• Fill missing values in df_scores using df_filled_scores. 50
Exercise 4
Create a multi-index DataFrame for a
company's sales in multiple regions: sales = pd.DataFrame({

• Pivot the sales table so that: 'Region': ['East', 'East', 'West', 'West', 'East', 'West'],
'Year': [2023, 2024, 2023, 2024, 2023, 2024],
• "Year" is the index. 'Product': ['A', 'B', 'A', 'B', 'B', 'A'],
• "Region" is in columns. 'Sales': [100, 200, 150, 250, None, None] # Missing Sales
values for certain products
• "Sales" values fill the table. })
• Melt the DataFrame back into long
df_profit = pd.DataFrame({
format. 'Region': ['East', 'East', 'West', 'West', 'East', 'West'],
• Merge this melted DataFrame with 'Year': [2023, 2024, 2023, 2024, 2023, 2024],

the profit DataFrame. 'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Profit': [20, 35, 25, None, 30, None] # Missing Profit values for certain
• Fill missing values in the merged products
})
DataFrame

51

You might also like