Python 101 - Python Libraries for Data Analysis - Numpy and Pandas
Python 101 - Python Libraries for Data Analysis - Numpy and Pandas
Pandas
import numpy as np
[21]: array([ 50, 60, 80, 100, 200, 300, 500, 600])
[5]: type(my_numpy_array)
[5]: numpy.ndarray
MINI CHALLENGE #1: - Write a code that creates the following 2x4 numpy array
[[3 7 9 3]
[4 3 2 2]]
1
[3]: x = np.array([[3, 7, 9, 3],
[4, 3, 2, 1]])
x
[ ]:
[10]: # "randint" is used to generate random integers between upper and lower bounds
x = np.random.randint(1,50)
x
[10]: 22
[11]: # "randint" can be used to generate a certain number of random itegers as␣
↪follows
[11]: array([77, 80, 61, 59, 73, 97, 19, 22, 82, 78, 49, 97, 75, 69, 84])
2
[12]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
MINI CHALLENGE #2: - Write a code that takes in a positive integer “x” from the user and
creates a 1x10 array with random numbers ranging from 0 to “x”
[22]: #ask user to inter a positive interger
x = int(input("please enter a positive integer value"))
#verification
if x <= 0:
print("please enter a positif integer")
else:
#create table 1X10
array = np.random.randint(0, x, size=(1, 10))
print("generated table:")
print(array)
3
please enter a positive integer value6
generated table:
[[5 3 2 4 4 1 4 4 3 2]]
[ ]:
[21]: array([1., 2., 3., 4., 5., 6., 7., 8., 9.])
[22]: z = np.exp(y)
z
MINI CHALLENGE #3: - Given the X and Y values below, obtain the distance between them
X = [5, 7, 20]
Y = [9, 15, 4]
4
[23]: x = np.array([5, 7, 20])
y = np.array([9, 15, 4])
d = np.sqrt(x**2 + y**2)
d
[6]: 50
[7]: # Starting from the first index 0 up until and NOT including the last element
my_numpy_array[0:3]
5
[11]: # Get one element
matrix[0][0]
[11]: 8
MINI CHALLENGE #4: - In the following matrix, replace the last row with 0
X = [2 30 20 -2 -4] [3 4 40 -3 -2] [-3 4 -6 90 10] [25 45 34 22 12] [13 24 22 32 37]
[30]: X[4] = 0
X
[ ]:
[ ]:
6
[46]: new_matrix = matrix[ matrix > 7 ]
new_matrix
MINI CHALLENGE #5: - In the following matrix, replace negative elements by 0 and replace odd
elements with -2
X = [2 30 20 -2 -4]
[3 4 40 -3 -2]
[-3 4 -6 90 10]
[25 45 34 22 12]
[13 24 22 32 37]
[4]: X = np.array([[2, 30, 20, -2, -4],
[3, 4, 40, -3, -2],
[-3, 4, -6, 90, 10],
[25, 45, 34, 22, 12],
[13, 24, 22, 32, 37]])
X
[23]: X[ X < 0 ]= 0
X[ X % 2 == 1] = -2
X
7
6 TASK #6: UNDERSTAND PANDAS FUNDAMENTALS
[35]: # Pandas is a data manipulation and analysis tool that is built on Numpy.
# Pandas uses a data structure known as DataFrame (think of it as Microsoft␣
↪excel in Python).
[25]: Bank Client ID Bank Client Name Net Worth [$] Years with bank
0 111 Chanel 3500 3
1 222 Steve 29000 4
2 333 Mitch 10000 9
3 444 Ryan 2000 5
[26]: pandas.core.frame.DataFrame
[28]: # you can only view the first couple of rows using .head()
bank_client_df.head(2)
[28]: Bank Client ID Bank Client Name Net Worth [$] Years with bank
0 111 Chanel 3500 3
1 222 Steve 29000 4
[29]: # you can only view the last couple of rows using .tail()
bank_client_df.tail(2)
[29]: Bank Client ID Bank Client Name Net Worth [$] Years with bank
2 333 Mitch 10000 9
3 444 Ryan 2000 5
MINI CHALLENGE #6: - A porfolio contains a collection of securities such as stocks, bonds and
ETFs. Define a dataframe named ‘portfolio_df’ that holds 3 different stock ticker symbols, number
8
of shares, and price per share (feel free to choose any stocks) - Calculate the total value of the
porfolio including all stocks
[44]: portfolio_df = pd.DataFrame({'stock ticker symbol':['AAPL', 'AMZN', 'T'],
'price per share [$]':[3500, 200, 40],
'Number of stocks': [3, 4, 9]})
portfolio_df
[44]: stock ticker symbol price per share [$] Number of stocks
0 AAPL 3500 3
1 AMZN 200 4
2 T 40 9
stocks_dollar_value.sum()
[46]: 11660
[ ]:
house_price_df[0]
[47]: City \
0 Vancouver, BC
1 Toronto, Ont
2 Ottawa, Ont
3 Calgary, Alb
4 Montreal, Que
5 Halifax, NS
6 Regina, Sask
7 Fredericton, NB
8 (adsbygoogle = window.adsbygoogle || []).push(…
9
6 $254,000
7 $198,000
8 (adsbygoogle = window.adsbygoogle || []).push(…
12 Month Change
0 + 2.63 %
1 +10.2 %
2 + 15.4 %
3 – 1.5 %
4 + 9.3 %
5 + 3.6 %
6 – 3.9 %
7 – 4.3 %
8 (adsbygoogle = window.adsbygoogle || []).push(…
[48]: house_price_df[1]
[48]: Province \
0 British Columbia
1 Ontario
2 Alberta
3 Quebec
4 Manitoba
5 Saskatchewan
6 Nova Scotia
7 Prince Edward Island
8 Newfoundland / Labrador
9 New Brunswick
10 Canadian Average
11 (adsbygoogle = window.adsbygoogle || []).push(…
12 Month Change
0 + 7.6 %
10
1 – 3.2 %
2 – 7.5 %
3 + 7.6 %
4 – 1.4 %
5 – 3.8 %
6 + 3.5 %
7 + 3.0 %
8 – 1.6 %
9 – 2.2 %
10 – 1.3 %
11 (adsbygoogle = window.adsbygoogle || []).push(…
[ ]:
[ ]:
MINI CHALLENGE #7: - Write a code that uses Pandas to read tabular US retirement data -
You can use data from here: https://www.ssa.gov/oact/progdata/nra.html
[ ]: retirement_df = pd.read_html('https://www.ssa.gov/oact/progdata/nra.html')
retirement_df[0]
[58]: Bank Client ID Bank Client Name Net Worth [$] Years with bank
0 111 Chanel 3500 3
1 222 Steve 29000 4
2 333 Mitch 10000 9
3 444 Ryan 2000 5
[59]: Bank Client ID Bank Client Name Net Worth [$] Years with bank
2 333 Mitch 10000 9
11
3 444 Ryan 2000 5
bank_client_df
[60]: Bank Client Name Net Worth [$] Years with bank
0 Chanel 3500 3
1 Steve 29000 4
2 Mitch 10000 9
3 Ryan 2000 5
[62]: Bank Client Name Net Worth [$] Years with bank
1 Steve 29000 4
2 Mitch 10000 9
[4]: Bank client ID Bank Client Name Net worth [$] Years with bank
0 111 Chanel 3500 3
1 222 Steve 29000 4
2 333 Mitch 10000 9
3 444 Ryan 2000 5
[2]: # Define a function that increases all clients networth (stocks) by a fixed␣
↪value of 20% (for simplicity sake)
def networth_update(balance):
return balance * 1.2
12
[5]: # You can apply a function to the DataFrame
bank_client_df['Net worth [$]'].apply(networth_update)
[5]: 0 4200.0
1 34800.0
2 12000.0
3 2400.0
Name: Net worth [$], dtype: float64
[ ]:
MINI CHALLENGE #9: - Define a function that triples the stock prices and adds $200 - Apply
the function to the DataFrame - Calculate the updated total networth of all clients combined
[8]: def networth_update(balance):
return balance *3 + 200
[11]: 0 10700
1 87200
2 30200
3 6200
Name: Net worth [$], dtype: int64
[ ]:
[12]: Bank client ID Bank Client Name Net worth [$] Years with bank
0 111 Chanel 3500 3
1 222 Steve 29000 4
2 333 Mitch 10000 9
3 444 Ryan 2000 5
13
[14]: # You can sort the values in the dataframe according to number of years with␣
↪bank
[14]: Bank client ID Bank Client Name Net worth [$] Years with bank
0 111 Chanel 3500 3
1 222 Steve 29000 4
3 444 Ryan 2000 5
2 333 Mitch 10000 9
[15]: # Note that nothing changed in memory! you have to make sure that inplace is␣
↪set to True
bank_client_df
[15]: Bank client ID Bank Client Name Net worth [$] Years with bank
0 111 Chanel 3500 3
1 222 Steve 29000 4
2 333 Mitch 10000 9
3 444 Ryan 2000 5
[ ]: # Set inplace = True to ensure that change has taken place in memory
bank_client_df.sort_values(by = 'Years with bank', inplace = True)
[16]: Bank client ID Bank Client Name Net worth [$] Years with bank
0 111 Chanel 3500 3
1 222 Steve 29000 4
2 333 Mitch 10000 9
3 444 Ryan 2000 5
14
[24]: A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
[25]: df1
[25]: A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
[28]: df2
[28]: A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
[30]: df3
[30]: A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
[31]: A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
15
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
Bank_df_1
[42]: raw_data = { 'Bank Client ID': ['6', '7', '8', '9', '10'],
'First Name':['Babacar', 'Ibrahima', 'Assane', 'Youssoufa',␣
↪'alphonse'],
16
Bank_df_2 = pd.DataFrame(raw_data, columns = ['Bank Client ID', 'First Name',␣
↪'Last Name'])
Bank_df_2
bank_df_salary
17
[58]: bank_df_all['Bank Client ID'] = bank_df_all['Bank Client ID'].astype(int)
bank_df_salary['Bank Client ID'] = bank_df_salary['Bank Client ID'].astype(int)
new_client_df
[84]: Bank Client ID First Name Last Name Annual Salary [$/year]
0 11 Cheikh Thiame NaN
[ ]:
[ ]:
[ ]:
13 EXCELLENT JOB!
18
MINI CHALLENGE #2 SOLUTION: - Write a code that takes in a positive integer “x” from the
user and creates a 1x10 array with random numbers ranging from 0 to “x”
[ ]: x = int(input("Please enter a positive integer value: "))
x = np.random.randint(1, x, 10)
x
[ ]:
MINI CHALLENGE #3 SOLUTION: - Given the X and Y values below, obtain the distance
between them
X = [5, 7, 20]
Y = [9, 15, 4]
[ ]: X = np.array([5, 7, 20])
Y = np.array([9, 15, 4])
Z = np.sqrt(X**2 + Y**2)
Z
MINI CHALLENGE #4 SOLUTION: - In the following matrix, replace the last row with 0
X = [2 30 20 -2 -4]
[3 4 40 -3 -2]
[-3 4 -6 90 10]
[25 45 34 22 12]
[13 24 22 32 37]
[ ]: X = np.array([[2, 30, 20, -2, -4],
[3, 4, 40, -3, -2],
[-3, 4, -6, 90, 10],
[25, 45, 34, 22, 12],
[13, 24, 22, 32, 37]])
[ ]: X[4] = 0
X
MINI CHALLENGE #5 SOLUTION: - In the following matrix, replace negative elements by 0 and
replace odd elements with -2
X = [2 30 20 -2 -4]
[3 4 40 -3 -2]
[-3 4 -6 90 10]
[25 45 34 22 12]
[13 24 22 32 37]
[ ]: X = np.array([[2, 30, 20, -2, -4],
[3, 4, 40, -3, -2],
[-3, 4, -6, 90, 10],
[25, 45, 34, 22, 12],
19
[13, 24, 22, 32, 37]])
X[X<0] = 0
X[X%2==1] = -2
X
print(stocks_dollar_value)
print('Total portfolio value = {}'.format(stocks_dollar_value.sum()))
MINI CHALLENGE #7 SOLUTION: - Write a code that uses Pandas to read tabular US retirement
data - You can use data from here: https://www.ssa.gov/oact/progdata/nra.html
MINI CHALLENGE #9 SOLUTION: - Define a function that triples the stock prices and adds
$200 - Apply the function to the DataFrame - Calculate the updated total networth of all clients
combined
[ ]: def networth_update(balance):
return balance * 3 + 200
20
[ ]: results.sum()
PROJECT SOLUTION:
[ ]: # Creating a dataframe from a dictionary
# Let's define a dataframe with a list of bank clients with IDs = 1, 2, 3, 4, 5
Bank_df_1
# Let's define another dataframe for a separate list of clients (IDs = 6, 7, 8,␣
↪9, 10)
raw_data = {
'Bank Client ID': ['6', '7', '8', '9', '10'],
'First Name': ['Bill', 'Dina', 'Sarah', 'Heather', 'Holly'],
'Last Name': ['Christian', 'Mo', 'Steve', 'Bob', 'Michelle']}
Bank_df_2 = pd.DataFrame(raw_data, columns = ['Bank Client ID', 'First Name',␣
↪'Last Name'])
Bank_df_2
bank_df_salary
21
bank_df_all
[ ]: new_client = {
'Bank Client ID': ['11'],
'First Name': ['Ry'],
'Last Name': ['Aly'],
'Annual Salary [$/year]' : [1000]}
new_client_df = pd.DataFrame(new_client, columns = ['Bank Client ID', 'First␣
↪Name', 'Last Name', 'Annual Salary [$/year]'])
new_client_df
[ ]:
22