0% found this document useful (0 votes)

16 views10 pages

Exercises 2

The document provides exercises for practicing pandas skills on various datasets. It includes instructions to import and clean loan and purchase datasets, analyze salaries data in India, and complete advanced tasks like data merging, custom functions, normalization, and exploring external datasets.

Uploaded by

nekromany.99.10.10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views10 pages

Exercises 2

Uploaded by

nekromany.99.10.10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Exercise set 2 – pandas –module

Put all your exercises (Jupyter Notebooks or Python-files) in your course Git-project.
Use either code comments or Jupyter Notebook markdown (text) to document which
exercise you are doing and what a certain code section does!
You can return all of these exercises in a single Jupyter Notebook, if you wish.

The datasets for these exercises have been collected from kaggle.com
(a service providing different datasets for practice)

1. import pandas and read the csv-file found in Moodle (loans.csv). Use
Python coding with pandas to answer the questions.
 Remove the Customer ID –column from data

 Print the head of the data

 Remove rows from the data that have a too large of a loan
(Current Loan Amount should be less than 99999999)
 Tip: use filtering!

 Remove rows that have the annual income as NaN (not a number)
 Extra task: use imputation to use average income as the
value instead of NaN
 Get the average Current Loan Amount

 Get the highest and lowest Annual Income in the dataset

 Get the Home Ownership value of the

Loan ID = bbf87a87-22cd-4d10-bd9b-7a9cc1b6e59d

 Create a new field into your dataset called Actual Annual Income.

The Actual Annual Income follow this formula:

Annual Income – 12 * Monthly Debt

 Get the Actual Annual Income of the loan with the ID =

76fa89b9-e6a8-49af-afa1-8151315aba8e

 Get the Loan ID of the loan with the smallest Actual Annual
Income

 How many loans are "Long term"?

 How many loaners have more than 1 bankruptcy?

 How many Short Term loans are for Home Improvements?

 How many unique loan purposes are there?

 What are the 3 most common loan purposes?

 Is there a correlation between Annual Income and Number of

Open Accounts or is there a correlation between Number of
Credit Problems and Bankruptcies?
 Explain your findings in your Jupyter notebook also.
2. Download the purchases.csv from Moodle, and do the following
observations:

 What was the total price sum of the Purchase Order Number
018H2015? (14 rows in total)

 What is the name and description of the purchased item with the
Purchase Order Number 3176273?

 How many occasions (rows) of purchase data happened during

the year 2013?

 What are the 5 most common Departments in the data?

 Extra task: What are 3 Departments using most money in
the data?

 Sort the data by Department Name

 Replace the $-sign of the Unit Price –column, and convert the
value to a float

 Smaller extra tasks:

 How many purchases in the data were IT Goods and had the
total price more than 50000 dollars?

 How many of the purchases have anything to do with IT? (IT

Goods, IT Services, IT Telecommunications)
 Other extra tasks:
 Create a new DataFrame, where you have filtered out
purchases that have a Total Price of 0 or less

 For this DataFrame, use groupby() –function twice to group

the purchases data by Acquisition Type, and then
calculating the result first by sum() and then by mean()

 Which two acquisition types have the largest sums and

means after grouping the data?
3. Download the data_salaries_india.csv from Moodle, and consider the
following questions of the data. Use any means in pandas (or even
NumPy) you wish to explain your answers.

 Before we can do anything with the salaries, we have to convert

them into something more usable
o Note: the salaries can be yearly, monthly or hourly salaries
 We don't also need the Indian rupee –sign (₹)
o You can use the code example in Moodle to help you out
with this (Salary filtering, pandas exercise 3)

 What are the most common values in different fields (Job Titles,
Companies, Location)? Based on the distribution, is the data
balanced or not?
o Extra task: there seem to be some Job Titles that are almost
the same, like "Machine Learning Data Associate" and
"Machine Learning Associate", combine these into
something common to reduce amount of options

 Are there any outliers in the data that might affect the averages
negatively (certain salaries)? Manage the outliers as you best see
fit (either remove them or leave them, based on your analysis)
 If we want to correlate upon categories (ordinal or binary), we
need to use factorize(). Factorize the Role-column, and add the
new column to the DataFrame.
o Note: using factorize() for nominal categories (Job Title,
Location, Company Name) doesn't work well, because the
numbers do not have any numeric magnitude. In other
words, these categories don't measure anything, they just
group data, so numerical comparison / correlation doesn't
mean anything statistically.

# example:
# factorize Role level into numbers
label1, unique1 = pd.factorize(df['Role'], sort=False)
df['ManagerRole'] = label1

 Finally, check out the correlations. Does anything correlate with

anything? Can we make any assumptions?
o Tip: When correlating against binary variables, sometimes
the Spearman correlation might be more sensitive, in
pandas:

df.corr(method="spearman")
 Extra tasks for this dataset:
o After all salaries have been converted to correct format by
using the helper function (check Moodle), use quantiles
and split the data into four different parts:
 Top 25%
=> quantile(0.75)
 Above average, values between top 25-50%
=> quantile(0.5) - quantile(0.75)
 Below average, values between top 50-75%
=> quantile(0.25) – quantile(0.5)
 Bottom 25%
=> quantile(0.25)

 What are the salary ranges for each quantile?

 See examples in pandas-materials on how to use
quantiles

o Did you get any ideas how this data could be improved?
Do we need some particular new data or some other
operations on the data? Should we filter something out
based on some other column?

Provide arguments for your answers in code comments.

 Note: There are many good possible answers here!

 Tip: How about replacing the "Salaries Reported"

column with actual rows based on that number? Try
doing this with the data!

 Remember: This data only represents data

engineering salaries based on selected Indian cities.
The world is a vast place 
Advanced tasks (varying challenges, some require Googling):

1. Data merge is a useful tool when you have multiple files of data that
have the exact same structure.

Download the two csv-files from Moodle (videogames1.csv,

videogames2.csv), and combine them into one Data Frame. Lastly, save
the Data Frame into a new csv-file => combined.csv.

2. Functions and lambdas allow us to extend the operations we wish to

do to columns and rows in pandas.

For example, the built-in functions may not be enough in all cases. Use
the data of exercise 1 (loans.csv), and create a new column called
"Income Group" that holds a text value.

Create either a function or a lambda, that determines the True or False –

value based on the Annual Income –column. Use the following table to
create the values:

Number range Income group

0-25000 $25k or less
25001-50000 $25k-$50k
50001-100000 $50k-$100k
100001-200000 $100k-$200k
200001+ $200k+

After creating the function/lambda, you can use it by using pandas'

.apply() –function.

Finally, get the amount of rows grouped by each of the new Income
Group –field values and print them out.
3. Normalization allows us to convert the values of numeric columns to be
between 0 and 1. This is helpful when two different numbers seem to
follow the same trend, but have completely different value ranges. For
example, gold and silver prices tend to follow similar patterns, but their
market worth is quite different. By using normalization, we can compare
these trends more easily.

Get historical prices of both gold and silver, and compare them without
and with normalization. You can plot the prices by using df.plot() –
function. Check the dataset in list in Moodle for some alternatives for
gold and silver prices.
4. Create an account to kaggle.com, and find any dataset that interests
you.

There's a list of possibly interesting datasets listed in Moodle as well.

Try to find interesting features in data, in particular:

 Clean up data first (rows with too many NaN –values), values that
are way too big or small, insignificant columns etc.)
 You can create new columns as well if it seems suitable! (either by
using functions or other means)
 Interesting correlations (.corr() –function) and other interesting
features in the data. Is something surprising in the data?
 Note: There are many ways on how to approach this exercise.

Python Interviews
No ratings yet
Python Interviews
154 pages
Infinera
No ratings yet
Infinera
1,444 pages
Class 12 IP Practical Record
No ratings yet
Class 12 IP Practical Record
33 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Data Analytics Lab Manuals 2025-2026-1
No ratings yet
Data Analytics Lab Manuals 2025-2026-1
39 pages
AL Notes
No ratings yet
AL Notes
61 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Python For Data Analysis Edgar
No ratings yet
Python For Data Analysis Edgar
49 pages
Even Students
No ratings yet
Even Students
36 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Python For ML
No ratings yet
Python For ML
41 pages
CO3 - 1 - Pandas Series and Data Frame
No ratings yet
CO3 - 1 - Pandas Series and Data Frame
37 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Practical File Infomatics Practices 2024-25
No ratings yet
Practical File Infomatics Practices 2024-25
39 pages
Data Analysis Tools
No ratings yet
Data Analysis Tools
26 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Python For Data Analysis: Dr. Kishore Kunal
100% (1)
Python For Data Analysis: Dr. Kishore Kunal
43 pages
Unit - 4 - Part 2
No ratings yet
Unit - 4 - Part 2
36 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Grade 12 Informatics Practical Practice 2024-25
No ratings yet
Grade 12 Informatics Practical Practice 2024-25
12 pages
(The Ultimate PDF) Practical File For I.P. Practical 2023-24
No ratings yet
(The Ultimate PDF) Practical File For I.P. Practical 2023-24
45 pages
Tutorial 2 QB & QP
No ratings yet
Tutorial 2 QB & QP
4 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Lec 4
No ratings yet
Lec 4
9 pages
Python - Pandas - Numpy Interview Q&A
No ratings yet
Python - Pandas - Numpy Interview Q&A
12 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
4BUIS014W Business Computing-Portfolio
No ratings yet
4BUIS014W Business Computing-Portfolio
7 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Lab Record IP
No ratings yet
Lab Record IP
13 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
28 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Ans Key Set A
No ratings yet
Ans Key Set A
6 pages
Python Data Science Assignment
No ratings yet
Python Data Science Assignment
3 pages
FDS Record-1-4
No ratings yet
FDS Record-1-4
18 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Dav End Sem
No ratings yet
Dav End Sem
2 pages
DS Question Bank Unit-1 Part-2
No ratings yet
DS Question Bank Unit-1 Part-2
3 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
PYQ Data Analysis and Visualisation Using Python GE May 2024
No ratings yet
PYQ Data Analysis and Visualisation Using Python GE May 2024
6 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
01 Erptree - Com
No ratings yet
01 Erptree - Com
12 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
A Project Proposal ON Agriculture Farm Management System: Enrollment No: 19stucmtd01023 Vi Semester, Class of 2019-2022
No ratings yet
A Project Proposal ON Agriculture Farm Management System: Enrollment No: 19stucmtd01023 Vi Semester, Class of 2019-2022
13 pages
Medical Shop Management System
No ratings yet
Medical Shop Management System
197 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
Pandas
No ratings yet
Pandas
5 pages
CSS Background
No ratings yet
CSS Background
11 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
SSL Certificate Renewal Automation
No ratings yet
SSL Certificate Renewal Automation
4 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Form 3 ICT EXAMS
No ratings yet
Form 3 ICT EXAMS
8 pages
TEMS Pocket
100% (1)
TEMS Pocket
26 pages
Pt. Centig Tour Wisata 1T PDF
No ratings yet
Pt. Centig Tour Wisata 1T PDF
1 page
ENEL2CM Assignment 2 (2025)
No ratings yet
ENEL2CM Assignment 2 (2025)
15 pages
A Prompt Pattern Catalog To Enhance Prompt Engineering With Chatgpt
No ratings yet
A Prompt Pattern Catalog To Enhance Prompt Engineering With Chatgpt
19 pages
Sanofi Coursera Pathways
No ratings yet
Sanofi Coursera Pathways
6 pages
Web 3 For Beginners
No ratings yet
Web 3 For Beginners
4 pages
MX 5201
No ratings yet
MX 5201
3 pages
ExoneratingMorocco DisprovingTheSpyware
No ratings yet
ExoneratingMorocco DisprovingTheSpyware
27 pages
MST 2 Study Material
No ratings yet
MST 2 Study Material
57 pages
3 HumanFactors
No ratings yet
3 HumanFactors
60 pages
Recdatatjv Manual Eng
No ratings yet
Recdatatjv Manual Eng
21 pages
Technical Drawing School Based Assessment
No ratings yet
Technical Drawing School Based Assessment
10 pages
A Map of The Networking Code
No ratings yet
A Map of The Networking Code
41 pages
4k Uhd TV Manual
No ratings yet
4k Uhd TV Manual
15 pages
WT Handbook PDF
No ratings yet
WT Handbook PDF
24 pages
List of PDF Software
No ratings yet
List of PDF Software
11 pages
Gallagher - SALTO: Wireless Access Solutions
No ratings yet
Gallagher - SALTO: Wireless Access Solutions
4 pages
Cybersecurity Doc
No ratings yet
Cybersecurity Doc
3 pages
Mikrotik ZOOM Conf
No ratings yet
Mikrotik ZOOM Conf
3 pages
Vocab Unit 10
No ratings yet
Vocab Unit 10
6 pages
Sandeep M G - Sandeep M G
No ratings yet
Sandeep M G - Sandeep M G
1 page
Lokesh S Resume
No ratings yet
Lokesh S Resume
2 pages
MIT App Inventor
No ratings yet
MIT App Inventor
1 page
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Tableau Training Manual 9.0 Basic Version: This Via Tableau Training Manual Was Created for Both New and Intermediate
From Everand
Tableau Training Manual 9.0 Basic Version: This Via Tableau Training Manual Was Created for Both New and Intermediate
Larry Keller
3/5 (1)
Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet

Exercises 2

Uploaded by

Exercises 2

Uploaded by

Exercise set 2 – pandas –module

 Print the head of the data

 Get the highest and lowest Annual Income in the dataset

 Get the Home Ownership value of the

The Actual Annual Income follow this formula:

 Get the Actual Annual Income of the loan with the ID =

 How many loans are "Long term"?

 How many loaners have more than 1 bankruptcy?

 How many Short Term loans are for Home Improvements?

 How many unique loan purposes are there?

 What are the 3 most common loan purposes?

 Is there a correlation between Annual Income and Number of

 How many occasions (rows) of purchase data happened during

 What are the 5 most common Departments in the data?

 Sort the data by Department Name

 Smaller extra tasks:

 How many of the purchases have anything to do with IT? (IT

 For this DataFrame, use groupby() –function twice to group

 Which two acquisition types have the largest sums and

 Before we can do anything with the salaries, we have to convert

 Finally, check out the correlations. Does anything correlate with

 What are the salary ranges for each quantile?

Provide arguments for your answers in code comments.

 Note: There are many good possible answers here!

 Tip: How about replacing the "Salaries Reported"

 Remember: This data only represents data

Download the two csv-files from Moodle (videogames1.csv,

2. Functions and lambdas allow us to extend the operations we wish to

Create either a function or a lambda, that determines the True or False –

Number range Income group

After creating the function/lambda, you can use it by using pandas'

There's a list of possibly interesting datasets listed in Moodle as well.

Try to find interesting features in data, in particular:

You might also like