0% found this document useful (0 votes)
16 views10 pages

Exercises 2

The document provides exercises for practicing pandas skills on various datasets. It includes instructions to import and clean loan and purchase datasets, analyze salaries data in India, and complete advanced tasks like data merging, custom functions, normalization, and exploring external datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

Exercises 2

The document provides exercises for practicing pandas skills on various datasets. It includes instructions to import and clean loan and purchase datasets, analyze salaries data in India, and complete advanced tasks like data merging, custom functions, normalization, and exploring external datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Exercise set 2 – pandas –module

Put all your exercises (Jupyter Notebooks or Python-files) in your course Git-project.
Use either code comments or Jupyter Notebook markdown (text) to document which
exercise you are doing and what a certain code section does!
You can return all of these exercises in a single Jupyter Notebook, if you wish.

The datasets for these exercises have been collected from kaggle.com
(a service providing different datasets for practice)

1. import pandas and read the csv-file found in Moodle (loans.csv). Use
Python coding with pandas to answer the questions.
 Remove the Customer ID –column from data

 Print the head of the data

 Remove rows from the data that have a too large of a loan
(Current Loan Amount should be less than 99999999)
 Tip: use filtering!

 Remove rows that have the annual income as NaN (not a number)
 Extra task: use imputation to use average income as the
value instead of NaN
 Get the average Current Loan Amount

 Get the highest and lowest Annual Income in the dataset

 Get the Home Ownership value of the


Loan ID = bbf87a87-22cd-4d10-bd9b-7a9cc1b6e59d

 Create a new field into your dataset called Actual Annual Income.

The Actual Annual Income follow this formula:


Annual Income – 12 * Monthly Debt

 Get the Actual Annual Income of the loan with the ID =


76fa89b9-e6a8-49af-afa1-8151315aba8e

 Get the Loan ID of the loan with the smallest Actual Annual
Income

 How many loans are "Long term"?

 How many loaners have more than 1 bankruptcy?

 How many Short Term loans are for Home Improvements?

 How many unique loan purposes are there?

 What are the 3 most common loan purposes?

 Is there a correlation between Annual Income and Number of


Open Accounts or is there a correlation between Number of
Credit Problems and Bankruptcies?
 Explain your findings in your Jupyter notebook also.
2. Download the purchases.csv from Moodle, and do the following
observations:

 What was the total price sum of the Purchase Order Number
018H2015? (14 rows in total)

 What is the name and description of the purchased item with the
Purchase Order Number 3176273?

 How many occasions (rows) of purchase data happened during


the year 2013?

 What are the 5 most common Departments in the data?


 Extra task: What are 3 Departments using most money in
the data?

 Sort the data by Department Name

 Replace the $-sign of the Unit Price –column, and convert the
value to a float

 Smaller extra tasks:

 How many purchases in the data were IT Goods and had the
total price more than 50000 dollars?

 How many of the purchases have anything to do with IT? (IT


Goods, IT Services, IT Telecommunications)
 Other extra tasks:
 Create a new DataFrame, where you have filtered out
purchases that have a Total Price of 0 or less

 For this DataFrame, use groupby() –function twice to group


the purchases data by Acquisition Type, and then
calculating the result first by sum() and then by mean()

 Which two acquisition types have the largest sums and


means after grouping the data?
3. Download the data_salaries_india.csv from Moodle, and consider the
following questions of the data. Use any means in pandas (or even
NumPy) you wish to explain your answers.

 Before we can do anything with the salaries, we have to convert


them into something more usable
o Note: the salaries can be yearly, monthly or hourly salaries
 We don't also need the Indian rupee –sign (₹)
o You can use the code example in Moodle to help you out
with this (Salary filtering, pandas exercise 3)

 What are the most common values in different fields (Job Titles,
Companies, Location)? Based on the distribution, is the data
balanced or not?
o Extra task: there seem to be some Job Titles that are almost
the same, like "Machine Learning Data Associate" and
"Machine Learning Associate", combine these into
something common to reduce amount of options

 Are there any outliers in the data that might affect the averages
negatively (certain salaries)? Manage the outliers as you best see
fit (either remove them or leave them, based on your analysis)
 If we want to correlate upon categories (ordinal or binary), we
need to use factorize(). Factorize the Role-column, and add the
new column to the DataFrame.
o Note: using factorize() for nominal categories (Job Title,
Location, Company Name) doesn't work well, because the
numbers do not have any numeric magnitude. In other
words, these categories don't measure anything, they just
group data, so numerical comparison / correlation doesn't
mean anything statistically.

# example:
# factorize Role level into numbers
label1, unique1 = pd.factorize(df['Role'], sort=False)
df['ManagerRole'] = label1

 Finally, check out the correlations. Does anything correlate with


anything? Can we make any assumptions?
o Tip: When correlating against binary variables, sometimes
the Spearman correlation might be more sensitive, in
pandas:

df.corr(method="spearman")
 Extra tasks for this dataset:
o After all salaries have been converted to correct format by
using the helper function (check Moodle), use quantiles
and split the data into four different parts:
 Top 25%
=> quantile(0.75)
 Above average, values between top 25-50%
=> quantile(0.5) - quantile(0.75)
 Below average, values between top 50-75%
=> quantile(0.25) – quantile(0.5)
 Bottom 25%
=> quantile(0.25)

 What are the salary ranges for each quantile?


 See examples in pandas-materials on how to use
quantiles

o Did you get any ideas how this data could be improved?
Do we need some particular new data or some other
operations on the data? Should we filter something out
based on some other column?

Provide arguments for your answers in code comments.

 Note: There are many good possible answers here!

 Tip: How about replacing the "Salaries Reported"


column with actual rows based on that number? Try
doing this with the data!

 Remember: This data only represents data


engineering salaries based on selected Indian cities.
The world is a vast place 
Advanced tasks (varying challenges, some require Googling):

1. Data merge is a useful tool when you have multiple files of data that
have the exact same structure.

Download the two csv-files from Moodle (videogames1.csv,


videogames2.csv), and combine them into one Data Frame. Lastly, save
the Data Frame into a new csv-file => combined.csv.

2. Functions and lambdas allow us to extend the operations we wish to


do to columns and rows in pandas.

For example, the built-in functions may not be enough in all cases. Use
the data of exercise 1 (loans.csv), and create a new column called
"Income Group" that holds a text value.

Create either a function or a lambda, that determines the True or False –


value based on the Annual Income –column. Use the following table to
create the values:

Number range Income group


0-25000 $25k or less
25001-50000 $25k-$50k
50001-100000 $50k-$100k
100001-200000 $100k-$200k
200001+ $200k+

After creating the function/lambda, you can use it by using pandas'


.apply() –function.

Finally, get the amount of rows grouped by each of the new Income
Group –field values and print them out.
3. Normalization allows us to convert the values of numeric columns to be
between 0 and 1. This is helpful when two different numbers seem to
follow the same trend, but have completely different value ranges. For
example, gold and silver prices tend to follow similar patterns, but their
market worth is quite different. By using normalization, we can compare
these trends more easily.

Get historical prices of both gold and silver, and compare them without
and with normalization. You can plot the prices by using df.plot() –
function. Check the dataset in list in Moodle for some alternatives for
gold and silver prices.
4. Create an account to kaggle.com, and find any dataset that interests
you.

There's a list of possibly interesting datasets listed in Moodle as well.

Try to find interesting features in data, in particular:


 Clean up data first (rows with too many NaN –values), values that
are way too big or small, insignificant columns etc.)
 You can create new columns as well if it seems suitable! (either by
using functions or other means)
 Interesting correlations (.corr() –function) and other interesting
features in the data. Is something surprising in the data?
 Note: There are many ways on how to approach this exercise.

You might also like