AD3301 - Data - Transformation - Ipynb - Colaboratory
AD3301 - Data - Transformation - Ipynb - Colaboratory
ipynb - Colaboratory
import pandas as pd
import numpy as np
Combining dataframes
dataFrame1 = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25,
dataFrame2 = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26,
# In the dataset above, the first column contains information about student identifier and
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 1/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
StudentID Score
0 1 89
1 3 39
2 5 50
3 7 97
4 9 22
5 11 66
6 13 31
7 15 51
8 17 71
9 19 91
10 21 56
The argument ignore_index creates new index and its absense keeps the original indices. Note,
we combined
11 the dataframes
23 32 along axis=0, that is to say, we combined together along same
direction.
12 What if we
25 want52
to combine both side by side. Then we have to specify axis = 1. Check
the output and see the difference.
13 27 73
14 29 92
pd.concat([dataFrame1, dataFrame2], axis=1)
15 2 98
16 StudentID
4 Score
93 StudentID Score
0
17 1
6 89
44 2 98
1
18 3
8 39
77 4 93
2
19 5
10 50
69 6 44
3
20 7
12 97
56 8 77
4
21 9
14 22
31 10 69
5
22 11
16 66
53 12 56
6
23 13
18 31
78 14 31
7
24 15
20 51
93 16 53
8
25 17
22 71
56 18 78
9
26 19
24 91
77 20 93
10
27 21
26 56
33 22 56
11
28 23
28 32
56 24 77
12
29 25
30 52
27 26 33
13 27 73 28 56
14 29 92 30 27
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 2/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
Merging
In the first example, you received two files for same subject. Now, consider the use case where
you are teaching two courses. So, you will get two dataframes from each sections: two for
Software engieering course and another two for Introduction to Machine learning course. Check
the figure given below:
df1SE = pd.DataFrame({ 'StudentID': [9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], 'ScoreSE
df2SE = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 3
df1ML = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 2
df2ML = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], 'ScoreML': [93, 4
As you can see in the dataset above, you have two dataframes for each subjects. So the first
task would be to concatenate these two subjects into one. Secondly, these students have taken
Introduction to Machine Learning course as well. So, we need to merge these score into the
same dataframes. There are several ways to do this. Let us explore some options.
# Option 1
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 3/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
0 1.0 39.0 9 22
1 3.0 49.0 11 66
2 5.0 55.0 13 31
3 7.0 77.0 15 51
4 9.0 52.0 17 71
5 11.0 86.0 19 91
6 13.0 41.0 21 56
7 15.0 77.0 23 32
8 17.0 73.0 25 52
9 19.0 51.0 27 73
10 21.0 86.0 29 92
11 23.0 82.0 2 98
12 2
# Option 25.0 92.0 4 93
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
13 27.0 23.0 6 44
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
14 29.0 49.0 8 77
df = dfSE.merge(dfML, how='inner')
df 15 2.0 93.0 10 69
16 4.0 44.0 12 56
# Here, you will perform inner join with each dataframe. That is to say, if an item exists
17 6.0 78.0 14 31
18 8.0 97.0 16 53
19 10.0 87.0 18 78
20 12.0 89.0 20 93
21 14.0 39.0 22 56
22 16.0 43.0 24 77
23 18.0 88.0 26 33
24 20.0 78.0 28 56
25 NaN NaN 30 27
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 4/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
0 9 22 52
1 11 66 86
2 13 31 41
3 15 51 77
4 17 71 73
5 19 91 51
6 21 56 86
# Option
7 3 23 32 82
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = 8pd.concat([df1ML,
25 52
df2ML], 92
ignore_index=True)
9 27 73 23
df = dfSE.merge(dfML, how='left')
df 10 29 92 49
11 2 98 93
12 4 93 44
13 6 44 78
14 8 77 97
15 10 69 87
16 12 56 89
17 14 31 39
18 16 53 43
19 18 78 88
20 20 93 78
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 5/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
0 9 22 52.0
1 11 66 86.0
2 13 31 41.0
3 15 51 77.0
4 17 71 73.0
5 19 91 51.0
# Option 4
dfSE = 6pd.concat([df1SE,
21 56
df2SE], 86.0
ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
7 23 32 82.0
df = dfSE.merge(dfML,
8 25 how='right')
52 92.0
df
9 27 73 23.0
10 29 92 49.0
11 2 98 93.0
12 4 93 44.0
13 6 44 78.0
14 8 77 97.0
15 10 69 87.0
16 12 56 89.0
17 14 31 39.0
18 16 53 43.0
19 18 78 88.0
20 20 93 78.0
21 22 56 NaN
22 24 77 NaN
23 26 33 NaN
24 28 56 NaN
25 30 27 NaN
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 6/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
0 9 22.0 52
1 11 66.0 86
2 13 31.0 41
3 5
# Option 15 51.0 77
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
4 17 71.0 73
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
5 19 91.0 51
df = dfSE.merge(dfML, how='outer')
df 6 21 56.0 86
7 23 32.0 82
8 25 52.0 92
9 27 73.0 23
10 29 92.0 49
11 2 98.0 93
12 4 93.0 44
13 6 44.0 78
14 8 77.0 97
15 10 69.0 87
16 12 56.0 89
17 14 31.0 39
18 16 53.0 43
19 18 78.0 88
20 20 93.0 78
21 1 NaN 39
22 3 NaN 49
23 5 NaN 55
24 7 NaN 77
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 7/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
0 9 22.0 52.0
1 11 66.0 86.0
2 13 31.0 41.0
3 15 51.0 77.0
4 17 71.0 73.0
5 19 91.0 51.0
df = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exploratory-d
6 21 56.0 86.0
df.head(10)
7 23 32.0 82.0
Account Company Order SKU Country Year Quantity
8 25 52.0 92.0
0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148
9 27 73.0 23.0
1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262
10 29 92.0 49.0
2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119
11 2 98.0 93.0
3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097
12 4 93.0 44.0
4 123456787 ABC Dogma 99996 s6-supercomputer Poland 1970 3356
13 6 44.0 78.0
5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474
14 8 77.0 97.0
6 123456783 ABC Dogma 99981 s11-supercomputer Spain 2006 4081
15 10 69.0 87.0
7 123456785 ABC Dogma 99998 s9-supercomputer Belarus 2015 6576
16 12 56.0 89.0
8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460
17 14 31.0 39.0
9 123456775 Kulas Inc 99997 s7-supercomputer French Guiana 2004 1831
18 16 53.0 43.0
19 18 78.0 88.0
#@title Default title text Default title text
20 colum that
#Add new 20 is the
93.0total price
78.0 based on the quantity and the unit price
21 22 56.0 NaN
df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
df.head(10)
22 24 77.0 NaN
23 26 33.0 NaN
24 28 56.0 NaN
25 30 27.0 NaN
26 1 NaN 39.0
27 3 NaN 49.0
28 5 NaN 55.0
29 7 NaN 77.0
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 8/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
df.describe()
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim', 'Stav
dframe1
Rainfall 0 1 2 3 4
Humidity 5 6 7 8 9
Wind 10 11 12 13 14
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 9/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
stacked = dframe1.stack()
stacked
Rainfall Bergen 0
Oslo 1
Trondheim 2
Stavanger 3
Kristiansand 4
Humidity Bergen 5
Oslo 6
Trondheim 7
Stavanger 8
Kristiansand 9
Wind Bergen 10
Oslo 11
Trondheim 12
Stavanger 13
Kristiansand 14
dtype: int64
stacked.unstack()
Rainfall 0 1 2 3 4
Humidity 5 6 7 8 9
Wind 10 11 12 13 14
Data deduplication
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 10/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
column 1 column 2
0 Looping 10
1 Looping 10
2 Looping 22
3 Functions 23
4 Functions 23
frame3.duplicated()
5 Functions 24
06 Functions
False 24
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
frame4 = frame3.drop_duplicates()
frame4
column 1 column 2
0 Looping 10
2 Looping 22
3 Functions 23
5 Functions 24
0 Looping 10 0
2 Looping 22 2
3 Functions 23 3
5 Functions 24 5
Replacing values
import numpy as np
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 11/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 33
replaceFrame.replace(to_replace =-786, value= np.nan)
column 1 column 2
0 200.0 0
1 3000.0 1
2 NaN 2
3 3000.0 3
4 234.0 4
5 444.0 5
6 NaN 6
7 332.0 7
8 3332.0 8
replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 33
replaceFrame.replace(to_replace =[-786, 0], value= [np.nan, 2])
column 1 column 2
0 200.0 2
1 3000.0 1
2 NaN 2
3 3000.0 3
4 234.0 4
5 444.0 5
6 NaN 6
7 332.0 7
8 3332.0 8
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 12/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
apple 15 16 17
banana 18 19 20
kiwi 21 22 23
grapes 24 25 26
mango 27 28 29
dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx
dfx.isnull()
dfx.notnull()
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 13/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
dfx.isnull().sum()
watermelon True True True True False
oranges 1
store1 False False False False False
store2 1
store3 1
store4 5
store5 7
dtype: int64
dfx.isnull().sum().sum()
15
dfx.count()
store1 6
store2 6
store3 6
store4 2
store5 0
dtype: int64
dfx.store4[dfx.store4.notnull()]
apple 20.0
watermelon 18.0
Name: store4, dtype: float64
dfx.store4.dropna()
apple 20.0
watermelon 18.0
Name: store4, dtype: float64
dfx.dropna()
dfx.dropna(how='all')
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 14/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
dfx.dropna(how='all', axis=1)
dfx2 = dfx.copy()
dfx2.loc['oranges'].store1 = 0
dfx2.loc['oranges'].store3 = 0
dfx2
dfx2.dropna(how='any', axis=1)
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 15/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
store1 store3
dfx.dropna(thresh=5,
watermelon axis=1) 17.0
15.0
ar1.mean(), ser1.mean()
(nan, 200.0)
ser2 = dfx.store4
ser2.sum()
38.0
ser2.mean()
19.0
ser2.cumsum()
apple 20.0
banana NaN
kiwi NaN
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 16/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
grapes NaN
mango NaN
watermelon 38.0
oranges NaN
Name: store4, dtype: float64
dfx.store4 + 1
apple 21.0
banana NaN
kiwi NaN
grapes NaN
mango NaN
watermelon 19.0
oranges NaN
Name: store4, dtype: float64
filledDf = dfx.fillna(0)
filledDf
dfx.mean()
store1 20.0
store2 21.0
store3 22.0
store4 19.0
store5 NaN
dtype: float64
filledDf.mean()
store1 17.142857
store2 18.000000
store3 18.857143
store4 5.428571
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 17/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
store5 0.000000
dtype: float64
dfx.store4.fillna(method='ffill')
apple 20.0
banana 20.0
kiwi 20.0
grapes 20.0
mango 20.0
watermelon 18.0
oranges 18.0
Name: store4, dtype: float64
dfx.store4.fillna(method='bfill')
apple 20.0
banana 18.0
kiwi 18.0
grapes 18.0
mango 18.0
watermelon 18.0
oranges NaN
Name: store4, dtype: float64
apple 14
mango 23
oranges 12
dtype: int64
dfx.store4.fillna(to_fill)
apple 20.0
banana NaN
kiwi NaN
grapes NaN
mango 23.0
watermelon 18.0
oranges 12.0
Name: store4, dtype: float64
dfx.fillna(dfx.mean())
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 18/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
0 100.0
1 148.0
2 196.0
3 244.0
4 292.0
dtype: float64
ts
2019-01-01 10.0
2019-02-01 NaN
2019-03-01 NaN
2019-05-01 9.0
dtype: float64
ts.interpolate()
2019-01-01 10.000000
2019-02-01 9.666667
2019-03-01 9.333333
2019-05-01 9.000000
dtype: float64
ts.interpolate(method='time')
2019-01-01 10.000000
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 19/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
2019-02-01 9.741667
2019-03-01 9.508333
2019-05-01 9.000000
dtype: float64
data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim', 'Stav
dframe1
Rainfall 0 1 2 3 4
Humidity 5 6 7 8 9
Wind 10 11 12 13 14
RAINFALL 0 1 2 3 4
HUMIDITY 5 6 7 8 9
WIND 10 11 12 13 14
dframe1.rename(index=str.title, columns=str.upper)
Rainfall 0 1 2 3 4
Humidity 5 6 7 8 9
Wind 10 11 12 13 14
import pandas as pd
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]
category
[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ..., (125, 135], (160,
200], (135, 160], (135, 160], (125, 135]]
Length: 12
Categories (4, interval[int64]): [(118, 125] < (125, 135] < (135, 160] < (160, 200]]
pd.value_counts(category)
(118, 125] 5
(135, 160] 3
(125, 135] 3
(160, 200] 1
dtype: int64
category2
[[118, 126), [118, 126), [118, 126), [126, 136), [118, 126), ..., [126, 136), [161,
200), [136, 161), [136, 161), [126, 136)]
Length: 12
Categories (4, interval[int64]): [[118, 126) < [126, 136) < [136, 161) < [161, 200)]
[Short Height, Short Height, Short Height, Averge height, Short Height, ..., Averge
height, Taller, Good Height, Good Height, Averge height]
Length: 12
Categories (4, object): [Short Height < Averge height < Good Height < Taller]
pd.cut(np.random.rand(40), 5, precision=2)
[(0.21, 0.41], (0.21, 0.41], (0.79, 0.98], (0.02, 0.21], (0.79, 0.98], ..., (0.41,
0.6], (0.02, 0.21], (0.6, 0.79], (0.02, 0.21], (0.6, 0.79]]
Length: 40
Categories (5, interval[float64]): [(0.02, 0.21] < (0.21, 0.41] < (0.41, 0.6] <
(0.6, 0.79] <
(0.79, 0.98]]
randomNumbers = np.random.rand(2000)
category3 = pd.qcut(randomNumbers, 4) # cut into quartiles
category3
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 21/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
pd.value_counts(category3)
df = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exploratory-d
df.head(10)
5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474
df.describe()
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 22/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
max Account
1.234568e+08 Company Order2019.000000
99999.000000 SKU
9999.000000 Country Year
700.000000 Quantity
5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474
2 3711433
7 3965328
13 4758900
15 5189372
17 3989325
...
9977 3475824
9984 5251134
9987 5670420
9991 5735513
9996 3018490
Name: TotalPrice, Length: 2094, dtype: int64
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 23/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
818 123456781 Gen Power 99991 s1-supercomputer Burkina Faso 1985 9693
7674 123456781 Loolo INC 99989 s6-supercomputer Sri Lanka 1994 9882
Permunation
8684 123456785and
GenRandom
Power 99989 sampling
s2-supercomputer Kenya 2013 9805
dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)
df
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
1 8 9 10 11 12 13 14 15
2 16 17 18 19 20 21 22 23
3 24 25 26 27 28 29 30 31
4 32 33 34 35 36 37 38 39
5 40 41 42 43 44 45 46 47
6 48 49 50 51 52 53 54 55
7 56 57 58 59 60 61 62 63
8 64 65 66 67 68 69 70 71
9 72 73 74 75 76 77 78 79
sampler = np.random.permutation(10)
sampler
array([1, 5, 3, 6, 2, 4, 9, 0, 7, 8])
df.take(sampler)
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 24/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
0 1 2 3 4 5 6 7
1 8 9 10 11 12 13 14 15
5 40 41 42 43 44 45 46 47
3 24 25 26 27 28 29 30 31
6 48 49 50 51 52 53 54 55
2 16 17 18 19 20 21 22 23
4 32 33 34 35 36 37 38 39
9 72 73 74 75 76 77 78 79
0 0 1 2 3 4 5 6 7
7 56 57 58 59 60 61 62 63
# Random sample without replacement
8 64 65 66 67 68 69 70 71
df.take(np.random.permutation(len(df))[:3])
0 1 2 3 4 5 6 7
9 72 73 74 75 76 77 78 79
2 16 17 18 19 20 21 22 23
0 0 1 2 3 4 5 6 7
array([3, 3, 0, 4, 0, 0, 1, 2, 1, 4])
draw = sack.take(sampler)
draw
Dummy variables
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 25/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
gender votes
0 female 6
1 female 7
2 male 8
pd.get_dummies(df['gender'])
3 unknown 9
4 female
male male 10unknown
5
0 female
1 0 11 0
1 1 0 0
2 0 1 0
3 0 0 1
4 0 1 0
5 1 0 0
0 1 0 0
1 1 0 0
2 0 1 0
3 0 0 1
4 0 1 0
5 1 0 0
with_dummy = df[['votes']].join(dummies)
with_dummy
0 6 1 0 0
1 7 1 0 0
2 8 0 1 0
3 9 0 0 1
4 10 0 1 0
5 11 1 0 0
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 26/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory
https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 27/27