Exercises 2
Exercises 2
Put all your exercises (Jupyter Notebooks or Python-files) in your course Git-project.
Use either code comments or Jupyter Notebook markdown (text) to document which
exercise you are doing and what a certain code section does!
You can return all of these exercises in a single Jupyter Notebook, if you wish.
The datasets for these exercises have been collected from kaggle.com
(a service providing different datasets for practice)
1. import pandas and read the csv-file found in Moodle (loans.csv). Use
Python coding with pandas to answer the questions.
Remove the Customer ID –column from data
Remove rows from the data that have a too large of a loan
(Current Loan Amount should be less than 99999999)
Tip: use filtering!
Remove rows that have the annual income as NaN (not a number)
Extra task: use imputation to use average income as the
value instead of NaN
Get the average Current Loan Amount
Create a new field into your dataset called Actual Annual Income.
Get the Loan ID of the loan with the smallest Actual Annual
Income
What was the total price sum of the Purchase Order Number
018H2015? (14 rows in total)
What is the name and description of the purchased item with the
Purchase Order Number 3176273?
Replace the $-sign of the Unit Price –column, and convert the
value to a float
How many purchases in the data were IT Goods and had the
total price more than 50000 dollars?
What are the most common values in different fields (Job Titles,
Companies, Location)? Based on the distribution, is the data
balanced or not?
o Extra task: there seem to be some Job Titles that are almost
the same, like "Machine Learning Data Associate" and
"Machine Learning Associate", combine these into
something common to reduce amount of options
Are there any outliers in the data that might affect the averages
negatively (certain salaries)? Manage the outliers as you best see
fit (either remove them or leave them, based on your analysis)
If we want to correlate upon categories (ordinal or binary), we
need to use factorize(). Factorize the Role-column, and add the
new column to the DataFrame.
o Note: using factorize() for nominal categories (Job Title,
Location, Company Name) doesn't work well, because the
numbers do not have any numeric magnitude. In other
words, these categories don't measure anything, they just
group data, so numerical comparison / correlation doesn't
mean anything statistically.
# example:
# factorize Role level into numbers
label1, unique1 = pd.factorize(df['Role'], sort=False)
df['ManagerRole'] = label1
df.corr(method="spearman")
Extra tasks for this dataset:
o After all salaries have been converted to correct format by
using the helper function (check Moodle), use quantiles
and split the data into four different parts:
Top 25%
=> quantile(0.75)
Above average, values between top 25-50%
=> quantile(0.5) - quantile(0.75)
Below average, values between top 50-75%
=> quantile(0.25) – quantile(0.5)
Bottom 25%
=> quantile(0.25)
o Did you get any ideas how this data could be improved?
Do we need some particular new data or some other
operations on the data? Should we filter something out
based on some other column?
1. Data merge is a useful tool when you have multiple files of data that
have the exact same structure.
For example, the built-in functions may not be enough in all cases. Use
the data of exercise 1 (loans.csv), and create a new column called
"Income Group" that holds a text value.
Finally, get the amount of rows grouped by each of the new Income
Group –field values and print them out.
3. Normalization allows us to convert the values of numeric columns to be
between 0 and 1. This is helpful when two different numbers seem to
follow the same trend, but have completely different value ranges. For
example, gold and silver prices tend to follow similar patterns, but their
market worth is quite different. By using normalization, we can compare
these trends more easily.
Get historical prices of both gold and silver, and compare them without
and with normalization. You can plot the prices by using df.plot() –
function. Check the dataset in list in Moodle for some alternatives for
gold and silver prices.
4. Create an account to kaggle.com, and find any dataset that interests
you.