0% found this document useful (0 votes)
54 views

AD3301 - Data - Transformation - Ipynb - Colaboratory

Uploaded by

palaniappan.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

AD3301 - Data - Transformation - Ipynb - Colaboratory

Uploaded by

palaniappan.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

9/7/22, 10:22 AM Chapter_4_Data_Transformation.

ipynb - Colaboratory

import pandas as pd
import numpy as np

Combining dataframes

dataFrame1 = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25,
dataFrame2 = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26,

# In the dataset above, the first column contains information about student identifier and

# We can do that by using Pandas concat() method.

dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)


dataframe

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 1/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID Score

0 1 89

1 3 39

2 5 50

3 7 97

4 9 22

5 11 66

6 13 31

7 15 51

8 17 71

9 19 91

10 21 56
The argument ignore_index creates new index and its absense keeps the original indices. Note,
we combined
11 the dataframes
23 32 along axis=0, that is to say, we combined together along same
direction.
12 What if we
25 want52
to combine both side by side. Then we have to specify axis = 1. Check
the output and see the difference.
13 27 73

14 29 92
pd.concat([dataFrame1, dataFrame2], axis=1)
15 2 98

16 StudentID
4 Score
93 StudentID Score

0
17 1
6 89
44 2 98

1
18 3
8 39
77 4 93

2
19 5
10 50
69 6 44

3
20 7
12 97
56 8 77

4
21 9
14 22
31 10 69

5
22 11
16 66
53 12 56

6
23 13
18 31
78 14 31

7
24 15
20 51
93 16 53

8
25 17
22 71
56 18 78

9
26 19
24 91
77 20 93

10
27 21
26 56
33 22 56

11
28 23
28 32
56 24 77

12
29 25
30 52
27 26 33

13 27 73 28 56

14 29 92 30 27

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 2/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Merging
In the first example, you received two files for same subject. Now, consider the use case where
you are teaching two courses. So, you will get two dataframes from each sections: two for
Software engieering course and another two for Introduction to Machine learning course. Check
the figure given below:

df1SE = pd.DataFrame({ 'StudentID': [9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], 'ScoreSE
df2SE = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 3

df1ML = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 2
df2ML = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], 'ScoreML': [93, 4

As you can see in the dataset above, you have two dataframes for each subjects. So the first
task would be to concatenate these two subjects into one. Secondly, these students have taken
Introduction to Machine Learning course as well. So, we need to merge these score into the
same dataframes. There are several ways to do this. Let us explore some options.

# Option 1
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = pd.concat([dfML, dfSE], axis=1)


df

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 3/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreML StudentID ScoreSE

0 1.0 39.0 9 22

1 3.0 49.0 11 66

2 5.0 55.0 13 31

3 7.0 77.0 15 51

4 9.0 52.0 17 71

5 11.0 86.0 19 91

6 13.0 41.0 21 56

7 15.0 77.0 23 32

8 17.0 73.0 25 52

9 19.0 51.0 27 73

10 21.0 86.0 29 92

11 23.0 82.0 2 98

12 2
# Option 25.0 92.0 4 93
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
13 27.0 23.0 6 44
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
14 29.0 49.0 8 77
df = dfSE.merge(dfML, how='inner')
df 15 2.0 93.0 10 69

16 4.0 44.0 12 56
# Here, you will perform inner join with each dataframe. That is to say, if an item exists
17 6.0 78.0 14 31

18 8.0 97.0 16 53

19 10.0 87.0 18 78

20 12.0 89.0 20 93

21 14.0 39.0 22 56

22 16.0 43.0 24 77

23 18.0 88.0 26 33

24 20.0 78.0 28 56

25 NaN NaN 30 27

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 4/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreSE ScoreML

0 9 22 52

1 11 66 86

2 13 31 41

3 15 51 77

4 17 71 73

5 19 91 51

6 21 56 86

# Option
7 3 23 32 82
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = 8pd.concat([df1ML,
25 52
df2ML], 92
ignore_index=True)
9 27 73 23
df = dfSE.merge(dfML, how='left')
df 10 29 92 49

11 2 98 93

12 4 93 44

13 6 44 78

14 8 77 97

15 10 69 87

16 12 56 89

17 14 31 39

18 16 53 43

19 18 78 88

20 20 93 78

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 5/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreSE ScoreML

0 9 22 52.0

1 11 66 86.0

2 13 31 41.0

3 15 51 77.0

4 17 71 73.0

5 19 91 51.0
# Option 4
dfSE = 6pd.concat([df1SE,
21 56
df2SE], 86.0
ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
7 23 32 82.0

df = dfSE.merge(dfML,
8 25 how='right')
52 92.0
df
9 27 73 23.0

10 29 92 49.0

11 2 98 93.0

12 4 93 44.0

13 6 44 78.0

14 8 77 97.0

15 10 69 87.0

16 12 56 89.0

17 14 31 39.0

18 16 53 43.0

19 18 78 88.0

20 20 93 78.0

21 22 56 NaN

22 24 77 NaN

23 26 33 NaN

24 28 56 NaN

25 30 27 NaN

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 6/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreSE ScoreML

0 9 22.0 52

1 11 66.0 86

2 13 31.0 41

3 5
# Option 15 51.0 77
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
4 17 71.0 73
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
5 19 91.0 51
df = dfSE.merge(dfML, how='outer')
df 6 21 56.0 86

7 23 32.0 82

8 25 52.0 92

9 27 73.0 23

10 29 92.0 49

11 2 98.0 93

12 4 93.0 44

13 6 44.0 78

14 8 77.0 97

15 10 69.0 87

16 12 56.0 89

17 14 31.0 39

18 16 53.0 43

19 18 78.0 88

20 20 93.0 78

21 1 NaN 39

22 3 NaN 49

23 5 NaN 55

24 7 NaN 77

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 7/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreSE ScoreML

0 9 22.0 52.0

1 11 66.0 86.0

2 13 31.0 41.0

3 15 51.0 77.0

4 17 71.0 73.0

5 19 91.0 51.0
df = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exploratory-d
6 21 56.0 86.0
df.head(10)
7 23 32.0 82.0
Account Company Order SKU Country Year Quantity
8 25 52.0 92.0
0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148
9 27 73.0 23.0
1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262
10 29 92.0 49.0
2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119
11 2 98.0 93.0
3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097
12 4 93.0 44.0
4 123456787 ABC Dogma 99996 s6-supercomputer Poland 1970 3356
13 6 44.0 78.0
5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474
14 8 77.0 97.0
6 123456783 ABC Dogma 99981 s11-supercomputer Spain 2006 4081
15 10 69.0 87.0
7 123456785 ABC Dogma 99998 s9-supercomputer Belarus 2015 6576
16 12 56.0 89.0
8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460
17 14 31.0 39.0
9 123456775 Kulas Inc 99997 s7-supercomputer French Guiana 2004 1831
18 16 53.0 43.0

19 18 78.0 88.0
#@title Default title text Default title text
20 colum that
#Add new 20 is the
93.0total price
78.0 based on the quantity and the unit price

21 22 56.0 NaN
df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
df.head(10)
22 24 77.0 NaN

23 26 33.0 NaN

24 28 56.0 NaN

25 30 27.0 NaN

26 1 NaN 39.0

27 3 NaN 49.0

28 5 NaN 55.0

29 7 NaN 77.0

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 8/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Account Company Order SKU Country Year Quantity

0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148

1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262


df['Company'].value_counts()
2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119
My SQ Man 869
3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097
Kirlosker Service Center 863
Will LLC
4 123456787 ABC Dogma 862 99996 s6-supercomputer Poland 1970 3356
ABC Dogma 848
Kulas Inc
5 123456778 840 99996
Super Sexy Dingo s9-supercomputer Costa Rica 2004 2474
Gen Power 836
6 123456783
Name IT ABC Dogma 836 99981 s11-supercomputer Spain 2006 4081
Super Sexy Dingo 828
7 123456785
GitHub ABC Dogma 823 99998 s9-supercomputer Belarus 2015 6576
Loolo INC 822
8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460
SAS Web Tec 798
Pryianka Ji
9 123456775 775 99997
Kulas Inc s7-supercomputer French Guiana 2004 1831
Name: Company, dtype: int64

df.describe()

Account Order Year Quantity UnitPrice TotalPri

count 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000 1.000000e+

mean 1.234568e+08 99989.562900 1994.619800 4985.447300 355.866600 1.773301e+

std 5.741156e+00 5.905551 14.432771 2868.949686 201.378478 1.540646e+

min 1.234568e+08 99980.000000 1970.000000 0.000000 10.000000 0.000000e+

25% 1.234568e+08 99985.000000 1982.000000 2505.750000 181.000000 5.003370e+

50% 1.234568e+08 99990.000000 1995.000000 4994.000000 356.000000 1.335698e+

75% 1.234568e+08 99995.000000 2007.000000 7451.500000 531.000000 2.711653e+

max 1.234568e+08 99999.000000 2019.000000 9999.000000 700.000000 6.841580e+

Reshaping with Hierarchical Indexing

data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim', 'Stav
dframe1

Bergen Oslo Trondheim Stavanger Kristiansand

Rainfall 0 1 2 3 4

Humidity 5 6 7 8 9

Wind 10 11 12 13 14

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 9/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

stacked = dframe1.stack()
stacked

Rainfall Bergen 0
Oslo 1
Trondheim 2
Stavanger 3
Kristiansand 4
Humidity Bergen 5
Oslo 6
Trondheim 7
Stavanger 8
Kristiansand 9
Wind Bergen 10
Oslo 11
Trondheim 12
Stavanger 13
Kristiansand 14
dtype: int64

stacked.unstack()

Bergen Oslo Trondheim Stavanger Kristiansand

Rainfall 0 1 2 3 4

Humidity 5 6 7 8 9

Wind 10 11 12 13 14

series1 = pd.Series([000, 111, 222, 333], index=['zeros','ones', 'twos', 'threes'])


series2 = pd.Series([444, 555, 666], index=['fours', 'fives', 'sixs'])

frame2 = pd.concat([series1, series2], keys=['Number1', 'Number2'])


frame2.unstack()

fives fours ones sixs threes twos zeros

Number1 NaN NaN 111.0 NaN 333.0 222.0 0.0

Number2 555.0 444.0 NaN 666.0 NaN NaN NaN

Data deduplication

frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4, 'column 2': [10, 1


frame3

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 10/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

column 1 column 2

0 Looping 10

1 Looping 10

2 Looping 22

3 Functions 23

4 Functions 23

frame3.duplicated()
5 Functions 24

06 Functions
False 24
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool

frame4 = frame3.drop_duplicates()
frame4

column 1 column 2

0 Looping 10

2 Looping 22

3 Functions 23

5 Functions 24

frame3['column 3'] = range(7)


frame5 = frame3.drop_duplicates(['column 2'])
frame5

column 1 column 2 column 3

0 Looping 10 0

2 Looping 22 2

3 Functions 23 3

5 Functions 24 5

Replacing values

import numpy as np

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 11/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 33
replaceFrame.replace(to_replace =-786, value= np.nan)

column 1 column 2

0 200.0 0

1 3000.0 1

2 NaN 2

3 3000.0 3

4 234.0 4

5 444.0 5

6 NaN 6

7 332.0 7

8 3332.0 8

replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 33
replaceFrame.replace(to_replace =[-786, 0], value= [np.nan, 2])

column 1 column 2

0 200.0 2

1 3000.0 1

2 NaN 2

3 3000.0 3

4 234.0 4

5 444.0 5

6 NaN 6

7 332.0 7

8 3332.0 8

Handling missing data

data = np.arange(15, 30).reshape(5, 3)


dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes', 'mango'], columns=['s
dfx

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 12/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store2 store3

apple 15 16 17

banana 18 19 20

kiwi 21 22 23

grapes 24 25 26

mango 27 28 29
dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 NaN

banana 18.0 19.0 20.0 NaN NaN

kiwi 21.0 22.0 23.0 NaN NaN

grapes 24.0 25.0 26.0 NaN NaN

mango 27.0 28.0 29.0 NaN NaN

watermelon 15.0 16.0 17.0 18.0 NaN

oranges NaN NaN NaN NaN NaN

dfx.isnull()

store1 store2 store3 store4 store5

apple False False False False True

banana False False False True True

kiwi False False False True True

grapes False False False True True

mango False False False True True

watermelon False False False False True

oranges True True True True True

dfx.notnull()

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 13/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store2 store3 store4 store5

apple True True True True False

banana True True True False False

kiwi True True True False False

grapes True True True False False

mango True True True False False

dfx.isnull().sum()
watermelon True True True True False

oranges 1
store1 False False False False False
store2 1
store3 1
store4 5
store5 7
dtype: int64

dfx.isnull().sum().sum()

15

dfx.count()

store1 6
store2 6
store3 6
store4 2
store5 0
dtype: int64

dfx.store4[dfx.store4.notnull()]

apple 20.0
watermelon 18.0
Name: store4, dtype: float64

dfx.store4.dropna()

apple 20.0
watermelon 18.0
Name: store4, dtype: float64

dfx.dropna()

store1 store2 store3 store4 store5

dfx.dropna(how='all')

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 14/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 NaN

banana 18.0 19.0 20.0 NaN NaN

kiwi 21.0 22.0 23.0 NaN NaN

grapes 24.0 25.0 26.0 NaN NaN

mango 27.0 28.0 29.0 NaN NaN

watermelon 15.0 16.0 17.0 18.0 NaN

dfx.dropna(how='all', axis=1)

store1 store2 store3 store4

apple 15.0 16.0 17.0 20.0

banana 18.0 19.0 20.0 NaN

kiwi 21.0 22.0 23.0 NaN

grapes 24.0 25.0 26.0 NaN

mango 27.0 28.0 29.0 NaN

watermelon 15.0 16.0 17.0 18.0

oranges NaN NaN NaN NaN

dfx2 = dfx.copy()
dfx2.loc['oranges'].store1 = 0
dfx2.loc['oranges'].store3 = 0
dfx2

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 NaN

banana 18.0 19.0 20.0 NaN NaN

kiwi 21.0 22.0 23.0 NaN NaN

grapes 24.0 25.0 26.0 NaN NaN

mango 27.0 28.0 29.0 NaN NaN

watermelon 15.0 16.0 17.0 18.0 NaN

oranges 0.0 NaN 0.0 NaN NaN

dfx2.dropna(how='any', axis=1)

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 15/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store3

apple 15.0 17.0

banana 18.0 20.0

kiwi 21.0 23.0

grapes 24.0 26.0

mango 27.0 29.0

dfx.dropna(thresh=5,
watermelon axis=1) 17.0
15.0

oranges 0.0 0.0

store1 store2 store3

apple 15.0 16.0 17.0

banana 18.0 19.0 20.0

kiwi 21.0 22.0 23.0

grapes 24.0 25.0 26.0

mango 27.0 28.0 29.0

watermelon 15.0 16.0 17.0

oranges NaN NaN NaN

NaN values in mathematical operations

ar1 = np.array([100, 200, np.nan, 300])


ser1 = pd.Series(ar1)

ar1.mean(), ser1.mean()

(nan, 200.0)

ser2 = dfx.store4
ser2.sum()

38.0

ser2.mean()

19.0

ser2.cumsum()

apple 20.0
banana NaN
kiwi NaN

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 16/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

grapes NaN
mango NaN
watermelon 38.0
oranges NaN
Name: store4, dtype: float64

dfx.store4 + 1

apple 21.0
banana NaN
kiwi NaN
grapes NaN
mango NaN
watermelon 19.0
oranges NaN
Name: store4, dtype: float64

Filling in missing data

filledDf = dfx.fillna(0)
filledDf

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 0.0

banana 18.0 19.0 20.0 0.0 0.0

kiwi 21.0 22.0 23.0 0.0 0.0

grapes 24.0 25.0 26.0 0.0 0.0

mango 27.0 28.0 29.0 0.0 0.0

watermelon 15.0 16.0 17.0 18.0 0.0

oranges 0.0 0.0 0.0 0.0 0.0

dfx.mean()

store1 20.0
store2 21.0
store3 22.0
store4 19.0
store5 NaN
dtype: float64

filledDf.mean()

store1 17.142857
store2 18.000000
store3 18.857143
store4 5.428571

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 17/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store5 0.000000
dtype: float64

Forward and backward filling of the missing values

dfx.store4.fillna(method='ffill')

apple 20.0
banana 20.0
kiwi 20.0
grapes 20.0
mango 20.0
watermelon 18.0
oranges 18.0
Name: store4, dtype: float64

dfx.store4.fillna(method='bfill')

apple 20.0
banana 18.0
kiwi 18.0
grapes 18.0
mango 18.0
watermelon 18.0
oranges NaN
Name: store4, dtype: float64

Filling with index labels

to_fill = pd.Series([14, 23, 12], index=['apple', 'mango', 'oranges'])


to_fill

apple 14
mango 23
oranges 12
dtype: int64

dfx.store4.fillna(to_fill)

apple 20.0
banana NaN
kiwi NaN
grapes NaN
mango 23.0
watermelon 18.0
oranges 12.0
Name: store4, dtype: float64

dfx.fillna(dfx.mean())

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 18/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 NaN

banana 18.0 19.0 20.0 19.0 NaN

kiwi 21.0 22.0 23.0 19.0 NaN

grapes 24.0 25.0 26.0 19.0 NaN

mango 27.0 28.0 29.0 19.0 NaN

watermelon 15.0 16.0 17.0 18.0 NaN

oranges 20.0 21.0 22.0 19.0 NaN

Interpolation of missing values

ser3 = pd.Series([100, np.nan, np.nan, np.nan, 292])


ser3.interpolate()

0 100.0
1 148.0
2 196.0
3 244.0
4 292.0
dtype: float64

from datetime import datetime


ts = pd.Series([10, np.nan, np.nan, 9],
index=[datetime(2019, 1,1),
datetime(2019, 2,1),
datetime(2019, 3,1),
datetime(2019, 5,1)])

ts

2019-01-01 10.0
2019-02-01 NaN
2019-03-01 NaN
2019-05-01 9.0
dtype: float64

ts.interpolate()

2019-01-01 10.000000
2019-02-01 9.666667
2019-03-01 9.333333
2019-05-01 9.000000
dtype: float64

ts.interpolate(method='time')

2019-01-01 10.000000

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 19/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

2019-02-01 9.741667
2019-03-01 9.508333
2019-05-01 9.000000
dtype: float64

Renaming axis indexes

data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim', 'Stav
dframe1

Bergen Oslo Trondheim Stavanger Kristiansand

Rainfall 0 1 2 3 4

Humidity 5 6 7 8 9

Wind 10 11 12 13 14

# Say, you want to transform the index terms to capital letter.


dframe1.index = dframe1.index.map(str.upper)
dframe1

Bergen Oslo Trondheim Stavanger Kristiansand

RAINFALL 0 1 2 3 4

HUMIDITY 5 6 7 8 9

WIND 10 11 12 13 14

dframe1.rename(index=str.title, columns=str.upper)

BERGEN OSLO TRONDHEIM STAVANGER KRISTIANSAND

Rainfall 0 1 2 3 4

Humidity 5 6 7 8 9

Wind 10 11 12 13 14

Discretization and binning

import pandas as pd

height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

bins = [118, 125, 135, 160, 200]


https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 20/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

category = pd.cut(height, bins)

category

[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ..., (125, 135], (160,
200], (135, 160], (135, 160], (125, 135]]
Length: 12
Categories (4, interval[int64]): [(118, 125] < (125, 135] < (135, 160] < (160, 200]]

pd.value_counts(category)

(118, 125] 5
(135, 160] 3
(125, 135] 3
(160, 200] 1
dtype: int64

category2 = pd.cut(height, [118, 126, 136, 161, 200], right=False)

category2

[[118, 126), [118, 126), [118, 126), [126, 136), [118, 126), ..., [126, 136), [161,
200), [136, 161), [136, 161), [126, 136)]
Length: 12
Categories (4, interval[int64]): [[118, 126) < [126, 136) < [136, 161) < [161, 200)]

bin_names = ['Short Height', 'Averge height', 'Good Height', 'Taller']


pd.cut(height, bins, labels=bin_names)

[Short Height, Short Height, Short Height, Averge height, Short Height, ..., Averge
height, Taller, Good Height, Good Height, Averge height]
Length: 12
Categories (4, object): [Short Height < Averge height < Good Height < Taller]

# Number of bins as integer


import numpy as np

pd.cut(np.random.rand(40), 5, precision=2)

[(0.21, 0.41], (0.21, 0.41], (0.79, 0.98], (0.02, 0.21], (0.79, 0.98], ..., (0.41,
0.6], (0.02, 0.21], (0.6, 0.79], (0.02, 0.21], (0.6, 0.79]]
Length: 40
Categories (5, interval[float64]): [(0.02, 0.21] < (0.21, 0.41] < (0.41, 0.6] <
(0.6, 0.79] <
(0.79, 0.98]]

randomNumbers = np.random.rand(2000)
category3 = pd.qcut(randomNumbers, 4) # cut into quartiles
category3

[(0.502, 0.758], (0.758, 1.0], (0.502, 0.758], (0.758, 1.0],

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 21/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

(-0.00013600000000000005, 0.239], ..., (0.239, 0.502], (0.239, 0.502], (0.239,


0.502], (0.502, 0.758], (0.758, 1.0]]
Length: 2000
Categories (4, interval[float64]): [(-0.00013600000000000005, 0.239] < (0.239,
0.502] < (0.502, 0.758] < (0.758, 1.0]]

pd.value_counts(category3)

(0.758, 1.0] 500


(0.502, 0.758] 500
(0.239, 0.502] 500
(-0.00013600000000000005, 0.239] 500
dtype: int64

pd.qcut(randomNumbers, [0, 0.3, 0.5, 0.7, 1.0])

[(0.502, 0.709], (0.709, 1.0], (0.502, 0.709], (0.709, 1.0],


(-0.00013600000000000005, 0.291], ..., (0.291, 0.502], (0.291, 0.502], (0.291,
0.502], (0.502, 0.709], (0.709, 1.0]]
Length: 2000
Categories (4, interval[float64]): [(-0.00013600000000000005, 0.291] < (0.291,
0.502] < (0.502, 0.709] < (0.709, 1.0]]

df = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exploratory-d
df.head(10)

Account Company Order SKU Country Year Quantity

0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148

1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262

2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119

3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097

4 123456787 ABC Dogma 99996 s6-supercomputer Poland 1970 3356

5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474

6 123456783 ABC Dogma 99981 s11-supercomputer Spain 2006 4081

7 123456785 ABC Dogma 99998 s9-supercomputer Belarus 2015 6576

8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460

9 123456775 Kulas Inc 99997 s7-supercomputer French Guiana 2004 1831

df.describe()

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 22/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Account Order Year Quantity UnitPrice

count 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000

mean 1.234568e+08 99989.562900 1994.619800 4985.447300 355.866600

std 5.741156e+00 5.905551 14.432771 2868.949686 201.378478

min 1.234568e+08 99980.000000 1970.000000 0.000000 10.000000

25% 1.234568e+08 99985.000000 1982.000000 2505.750000 181.000000


# Find values in order that exceeded
df['TotalPrice'] = df['UnitPrice']
50% 1.234568e+08 * df['Quantity']
99990.000000 1995.000000 4994.000000 356.000000
df.head(10)
75% 1.234568e+08 99995.000000 2007.000000 7451.500000 531.000000

max Account
1.234568e+08 Company Order2019.000000
99999.000000 SKU
9999.000000 Country Year
700.000000 Quantity

0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148

1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262

2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119

3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097

4 123456787 ABC Dogma 99996 s6-supercomputer Poland 1970 3356

5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474

6 123456783 ABC Dogma 99981 s11-supercomputer Spain 2006 4081

7 123456785 ABC Dogma 99998 s9-supercomputer Belarus 2015 6576

8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460

9 123456775 Kulas Inc 99997 s7-supercomputer French Guiana 2004 1831

# Find transaction exceeded 3000000


TotalTransaction = df["TotalPrice"]
TotalTransaction[np.abs(TotalTransaction) > 3000000]

2 3711433
7 3965328
13 4758900
15 5189372
17 3989325
...
9977 3475824
9984 5251134
9987 5670420
9991 5735513
9996 3018490
Name: TotalPrice, Length: 2094, dtype: int64

df[np.abs(TotalTransaction) > 6741112]

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 23/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Account Company Order SKU Country Year Quantity Un

818 123456781 Gen Power 99991 s1-supercomputer Burkina Faso 1985 9693

1402 123456778 Will LLC 99985 s11-supercomputer Austria 1990 9844

2242 123456770 Name IT 99997 s9-supercomputer Myanmar 1979 9804

2876 123456772 Gen Power 99992 s10-supercomputer Mali 2007 9935

3210 123456782 Loolo INC 99991 s8-supercomputer Kuwait 2006 9886

3629 123456779 My SQ Man 99980 s3-supercomputer Hong Kong 1994 9694

7674 123456781 Loolo INC 99989 s6-supercomputer Sri Lanka 1994 9882

8645 123456789 Gen Power 99996 s11-supercomputer Suriname 2005 9742

Permunation
8684 123456785and
GenRandom
Power 99989 sampling
s2-supercomputer Kenya 2013 9805

dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)

df

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

1 8 9 10 11 12 13 14 15

2 16 17 18 19 20 21 22 23

3 24 25 26 27 28 29 30 31

4 32 33 34 35 36 37 38 39

5 40 41 42 43 44 45 46 47

6 48 49 50 51 52 53 54 55

7 56 57 58 59 60 61 62 63

8 64 65 66 67 68 69 70 71

9 72 73 74 75 76 77 78 79

sampler = np.random.permutation(10)
sampler

array([1, 5, 3, 6, 2, 4, 9, 0, 7, 8])

df.take(sampler)

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 24/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

0 1 2 3 4 5 6 7

1 8 9 10 11 12 13 14 15

5 40 41 42 43 44 45 46 47

3 24 25 26 27 28 29 30 31

6 48 49 50 51 52 53 54 55

2 16 17 18 19 20 21 22 23

4 32 33 34 35 36 37 38 39

9 72 73 74 75 76 77 78 79

0 0 1 2 3 4 5 6 7

7 56 57 58 59 60 61 62 63
# Random sample without replacement
8 64 65 66 67 68 69 70 71
df.take(np.random.permutation(len(df))[:3])

0 1 2 3 4 5 6 7

9 72 73 74 75 76 77 78 79

2 16 17 18 19 20 21 22 23

0 0 1 2 3 4 5 6 7

# Random sample with replacement


sack = np.array([4, 8, -2, 7, 5])
sampler = np.random.randint(0, len(sack), size = 10)
sampler

array([3, 3, 0, 4, 0, 0, 1, 2, 1, 4])

draw = sack.take(sampler)
draw

array([ 7, 7, 4, 5, 4, 4, 8, -2, 8, 5])

Dummy variables

df = pd.DataFrame({'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'], 'v


df

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 25/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

gender votes

0 female 6

1 female 7

2 male 8
pd.get_dummies(df['gender'])
3 unknown 9

4 female
male male 10unknown

5
0 female
1 0 11 0

1 1 0 0

2 0 1 0

3 0 0 1

4 0 1 0

5 1 0 0

dummies = pd.get_dummies(df['gender'], prefix='gender')


dummies

gender_female gender_male gender_unknown

0 1 0 0

1 1 0 0

2 0 1 0

3 0 0 1

4 0 1 0

5 1 0 0

with_dummy = df[['votes']].join(dummies)
with_dummy

votes gender_female gender_male gender_unknown

0 6 1 0 0

1 7 1 0 0

2 8 0 1 0

3 9 0 0 1

4 10 0 1 0

5 11 1 0 0

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 26/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Colab paid products - Cancel contracts here

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 27/27

You might also like