0% found this document useful (0 votes)

95 views27 pages

Average Height Data Analysis

Uploaded by

palaniappan.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views27 pages

Average Height Data Analysis

Uploaded by

palaniappan.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

9/7/22, 10:22 AM Chapter_4_Data_Transformation.

ipynb - Colaboratory

import pandas as pd
import numpy as np

Combining dataframes

dataFrame1 = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25,
dataFrame2 = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26,

# In the dataset above, the first column contains information about student identifier and

# We can do that by using Pandas concat() method.

dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)

dataframe

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 1/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID Score

0 1 89

1 3 39

2 5 50

3 7 97

4 9 22

5 11 66

6 13 31

7 15 51

8 17 71

9 19 91

10 21 56
The argument ignore_index creates new index and its absense keeps the original indices. Note,
we combined
11 the dataframes
23 32 along axis=0, that is to say, we combined together along same
direction.
12 What if we
25 want52
to combine both side by side. Then we have to specify axis = 1. Check
the output and see the difference.
13 27 73

14 29 92
pd.concat([dataFrame1, dataFrame2], axis=1)
15 2 98

16 StudentID
4 Score
93 StudentID Score

0
17 1
6 89
44 2 98

1
18 3
8 39
77 4 93

2
19 5
10 50
69 6 44

3
20 7
12 97
56 8 77

4
21 9
14 22
31 10 69

5
22 11
16 66
53 12 56

6
23 13
18 31
78 14 31

7
24 15
20 51
93 16 53

8
25 17
22 71
56 18 78

9
26 19
24 91
77 20 93

10
27 21
26 56
33 22 56

11
28 23
28 32
56 24 77

12
29 25
30 52
27 26 33

13 27 73 28 56

14 29 92 30 27

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 2/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Merging
In the first example, you received two files for same subject. Now, consider the use case where
you are teaching two courses. So, you will get two dataframes from each sections: two for
Software engieering course and another two for Introduction to Machine learning course. Check
the figure given below:

df1SE = pd.DataFrame({ 'StudentID': [9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], 'ScoreSE
df2SE = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 3

df1ML = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 2
df2ML = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], 'ScoreML': [93, 4

As you can see in the dataset above, you have two dataframes for each subjects. So the first
task would be to concatenate these two subjects into one. Secondly, these students have taken
Introduction to Machine Learning course as well. So, we need to merge these score into the
same dataframes. There are several ways to do this. Let us explore some options.

# Option 1
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)

df = pd.concat([dfML, dfSE], axis=1)

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 3/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreML StudentID ScoreSE

0 1.0 39.0 9 22

1 3.0 49.0 11 66

2 5.0 55.0 13 31

3 7.0 77.0 15 51

4 9.0 52.0 17 71

5 11.0 86.0 19 91

6 13.0 41.0 21 56

7 15.0 77.0 23 32

8 17.0 73.0 25 52

9 19.0 51.0 27 73

10 21.0 86.0 29 92

11 23.0 82.0 2 98

12 2
# Option 25.0 92.0 4 93
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
13 27.0 23.0 6 44
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
14 29.0 49.0 8 77
df = dfSE.merge(dfML, how='inner')
df 15 2.0 93.0 10 69

16 4.0 44.0 12 56
# Here, you will perform inner join with each dataframe. That is to say, if an item exists
17 6.0 78.0 14 31

18 8.0 97.0 16 53

19 10.0 87.0 18 78

20 12.0 89.0 20 93

21 14.0 39.0 22 56

22 16.0 43.0 24 77

23 18.0 88.0 26 33

24 20.0 78.0 28 56

25 NaN NaN 30 27

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 4/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreSE ScoreML

0 9 22 52

1 11 66 86

2 13 31 41

3 15 51 77

4 17 71 73

5 19 91 51

6 21 56 86

# Option
7 3 23 32 82
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
dfML = 8pd.concat([df1ML,
25 52
df2ML], 92
ignore_index=True)
9 27 73 23
df = dfSE.merge(dfML, how='left')
df 10 29 92 49

11 2 98 93

12 4 93 44

13 6 44 78

14 8 77 97

15 10 69 87

16 12 56 89

17 14 31 39

18 16 53 43

19 18 78 88

20 20 93 78

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 5/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreSE ScoreML

0 9 22 52.0

1 11 66 86.0

2 13 31 41.0

3 15 51 77.0

4 17 71 73.0

5 19 91 51.0
# Option 4
dfSE = 6pd.concat([df1SE,
21 56
df2SE], 86.0
ignore_index=True)
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
7 23 32 82.0

df = dfSE.merge(dfML,
8 25 how='right')
52 92.0
df
9 27 73 23.0

10 29 92 49.0

11 2 98 93.0

12 4 93 44.0

13 6 44 78.0

14 8 77 97.0

15 10 69 87.0

16 12 56 89.0

17 14 31 39.0

18 16 53 43.0

19 18 78 88.0

20 20 93 78.0

21 22 56 NaN

22 24 77 NaN

23 26 33 NaN

24 28 56 NaN

25 30 27 NaN

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 6/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreSE ScoreML

0 9 22.0 52

1 11 66.0 86

2 13 31.0 41

3 5
# Option 15 51.0 77
dfSE = pd.concat([df1SE, df2SE], ignore_index=True)
4 17 71.0 73
dfML = pd.concat([df1ML, df2ML], ignore_index=True)
5 19 91.0 51
df = dfSE.merge(dfML, how='outer')
df 6 21 56.0 86

7 23 32.0 82

8 25 52.0 92

9 27 73.0 23

10 29 92.0 49

11 2 98.0 93

12 4 93.0 44

13 6 44.0 78

14 8 77.0 97

15 10 69.0 87

16 12 56.0 89

17 14 31.0 39

18 16 53.0 43

19 18 78.0 88

20 20 93.0 78

21 1 NaN 39

22 3 NaN 49

23 5 NaN 55

24 7 NaN 77

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 7/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

StudentID ScoreSE ScoreML

0 9 22.0 52.0

1 11 66.0 86.0

2 13 31.0 41.0

3 15 51.0 77.0

4 17 71.0 73.0

5 19 91.0 51.0
df = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exploratory-d
6 21 56.0 86.0
df.head(10)
7 23 32.0 82.0
Account Company Order SKU Country Year Quantity
8 25 52.0 92.0
0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148
9 27 73.0 23.0
1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262
10 29 92.0 49.0
2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119
11 2 98.0 93.0
3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097
12 4 93.0 44.0
4 123456787 ABC Dogma 99996 s6-supercomputer Poland 1970 3356
13 6 44.0 78.0
5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474
14 8 77.0 97.0
6 123456783 ABC Dogma 99981 s11-supercomputer Spain 2006 4081
15 10 69.0 87.0
7 123456785 ABC Dogma 99998 s9-supercomputer Belarus 2015 6576
16 12 56.0 89.0
8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460
17 14 31.0 39.0
9 123456775 Kulas Inc 99997 s7-supercomputer French Guiana 2004 1831
18 16 53.0 43.0

19 18 78.0 88.0
#@title Default title text Default title text
20 colum that
#Add new 20 is the
93.0total price
78.0 based on the quantity and the unit price

21 22 56.0 NaN
df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
df.head(10)
22 24 77.0 NaN

23 26 33.0 NaN

24 28 56.0 NaN

25 30 27.0 NaN

26 1 NaN 39.0

27 3 NaN 49.0

28 5 NaN 55.0

29 7 NaN 77.0

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 8/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Account Company Order SKU Country Year Quantity

0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148

1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262

df['Company'].value_counts()
2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119
My SQ Man 869
3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097
Kirlosker Service Center 863
Will LLC
4 123456787 ABC Dogma 862 99996 s6-supercomputer Poland 1970 3356
ABC Dogma 848
Kulas Inc
5 123456778 840 99996
Super Sexy Dingo s9-supercomputer Costa Rica 2004 2474
Gen Power 836
6 123456783
Name IT ABC Dogma 836 99981 s11-supercomputer Spain 2006 4081
Super Sexy Dingo 828
7 123456785
GitHub ABC Dogma 823 99998 s9-supercomputer Belarus 2015 6576
Loolo INC 822
8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460
SAS Web Tec 798
Pryianka Ji
9 123456775 775 99997
Kulas Inc s7-supercomputer French Guiana 2004 1831
Name: Company, dtype: int64

df.describe()

Account Order Year Quantity UnitPrice TotalPri

count 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000 1.000000e+

mean 1.234568e+08 99989.562900 1994.619800 4985.447300 355.866600 1.773301e+

std 5.741156e+00 5.905551 14.432771 2868.949686 201.378478 1.540646e+

min 1.234568e+08 99980.000000 1970.000000 0.000000 10.000000 0.000000e+

25% 1.234568e+08 99985.000000 1982.000000 2505.750000 181.000000 5.003370e+

50% 1.234568e+08 99990.000000 1995.000000 4994.000000 356.000000 1.335698e+

75% 1.234568e+08 99995.000000 2007.000000 7451.500000 531.000000 2.711653e+

max 1.234568e+08 99999.000000 2019.000000 9999.000000 700.000000 6.841580e+

Reshaping with Hierarchical Indexing

data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim', 'Stav
dframe1

Bergen Oslo Trondheim Stavanger Kristiansand

Rainfall 0 1 2 3 4

Humidity 5 6 7 8 9

Wind 10 11 12 13 14

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 9/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

stacked = dframe1.stack()
stacked

Rainfall Bergen 0
Oslo 1
Trondheim 2
Stavanger 3
Kristiansand 4
Humidity Bergen 5
Oslo 6
Trondheim 7
Stavanger 8
Kristiansand 9
Wind Bergen 10
Oslo 11
Trondheim 12
Stavanger 13
Kristiansand 14
dtype: int64

stacked.unstack()

Bergen Oslo Trondheim Stavanger Kristiansand

Rainfall 0 1 2 3 4

Humidity 5 6 7 8 9

Wind 10 11 12 13 14

series1 = pd.Series([000, 111, 222, 333], index=['zeros','ones', 'twos', 'threes'])

series2 = pd.Series([444, 555, 666], index=['fours', 'fives', 'sixs'])

frame2 = pd.concat([series1, series2], keys=['Number1', 'Number2'])

frame2.unstack()

fives fours ones sixs threes twos zeros

Number1 NaN NaN 111.0 NaN 333.0 222.0 0.0

Number2 555.0 444.0 NaN 666.0 NaN NaN NaN

Data deduplication

frame3 = pd.DataFrame({'column 1': ['Looping'] * 3 + ['Functions'] * 4, 'column 2': [10, 1

frame3

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 10/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

column 1 column 2

0 Looping 10

1 Looping 10

2 Looping 22

3 Functions 23

4 Functions 23

frame3.duplicated()
5 Functions 24

06 Functions
False 24
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool

frame4 = frame3.drop_duplicates()
frame4

column 1 column 2

0 Looping 10

2 Looping 22

3 Functions 23

5 Functions 24

frame3['column 3'] = range(7)

frame5 = frame3.drop_duplicates(['column 2'])
frame5

column 1 column 2 column 3

0 Looping 10 0

2 Looping 22 2

3 Functions 23 3

5 Functions 24 5

Replacing values

import numpy as np

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 11/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 33
replaceFrame.replace(to_replace =-786, value= np.nan)

column 1 column 2

0 200.0 0

1 3000.0 1

2 NaN 2

3 3000.0 3

4 234.0 4

5 444.0 5

6 NaN 6

7 332.0 7

8 3332.0 8

replaceFrame = pd.DataFrame({'column 1': [200., 3000., -786., 3000., 234., 444., -786., 33
replaceFrame.replace(to_replace =[-786, 0], value= [np.nan, 2])

column 1 column 2

0 200.0 2

1 3000.0 1

2 NaN 2

3 3000.0 3

4 234.0 4

5 444.0 5

6 NaN 6

7 332.0 7

8 3332.0 8

Handling missing data

data = np.arange(15, 30).reshape(5, 3)

dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes', 'mango'], columns=['s
dfx

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 12/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store2 store3

apple 15 16 17

banana 18 19 20

kiwi 21 22 23

grapes 24 25 26

mango 27 28 29
dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 NaN

banana 18.0 19.0 20.0 NaN NaN

kiwi 21.0 22.0 23.0 NaN NaN

grapes 24.0 25.0 26.0 NaN NaN

mango 27.0 28.0 29.0 NaN NaN

watermelon 15.0 16.0 17.0 18.0 NaN

oranges NaN NaN NaN NaN NaN

dfx.isnull()

store1 store2 store3 store4 store5

apple False False False False True

banana False False False True True

kiwi False False False True True

grapes False False False True True

mango False False False True True

watermelon False False False False True

oranges True True True True True

dfx.notnull()

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 13/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store2 store3 store4 store5

apple True True True True False

banana True True True False False

kiwi True True True False False

grapes True True True False False

mango True True True False False

dfx.isnull().sum()
watermelon True True True True False

oranges 1
store1 False False False False False
store2 1
store3 1
store4 5
store5 7
dtype: int64

dfx.isnull().sum().sum()

dfx.count()

store1 6
store2 6
store3 6
store4 2
store5 0
dtype: int64

dfx.store4[dfx.store4.notnull()]

apple 20.0
watermelon 18.0
Name: store4, dtype: float64

dfx.store4.dropna()

apple 20.0
watermelon 18.0
Name: store4, dtype: float64

dfx.dropna()

store1 store2 store3 store4 store5

dfx.dropna(how='all')

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 14/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 NaN

banana 18.0 19.0 20.0 NaN NaN

kiwi 21.0 22.0 23.0 NaN NaN

grapes 24.0 25.0 26.0 NaN NaN

mango 27.0 28.0 29.0 NaN NaN

watermelon 15.0 16.0 17.0 18.0 NaN

dfx.dropna(how='all', axis=1)

store1 store2 store3 store4

apple 15.0 16.0 17.0 20.0

banana 18.0 19.0 20.0 NaN

kiwi 21.0 22.0 23.0 NaN

grapes 24.0 25.0 26.0 NaN

mango 27.0 28.0 29.0 NaN

watermelon 15.0 16.0 17.0 18.0

oranges NaN NaN NaN NaN

dfx2 = dfx.copy()
dfx2.loc['oranges'].store1 = 0
dfx2.loc['oranges'].store3 = 0
dfx2

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 NaN

banana 18.0 19.0 20.0 NaN NaN

kiwi 21.0 22.0 23.0 NaN NaN

grapes 24.0 25.0 26.0 NaN NaN

mango 27.0 28.0 29.0 NaN NaN

watermelon 15.0 16.0 17.0 18.0 NaN

oranges 0.0 NaN 0.0 NaN NaN

dfx2.dropna(how='any', axis=1)

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 15/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store3

apple 15.0 17.0

banana 18.0 20.0

kiwi 21.0 23.0

grapes 24.0 26.0

mango 27.0 29.0

dfx.dropna(thresh=5,
watermelon axis=1) 17.0
15.0

oranges 0.0 0.0

store1 store2 store3

apple 15.0 16.0 17.0

banana 18.0 19.0 20.0

kiwi 21.0 22.0 23.0

grapes 24.0 25.0 26.0

mango 27.0 28.0 29.0

watermelon 15.0 16.0 17.0

oranges NaN NaN NaN

NaN values in mathematical operations

ar1 = np.array([100, 200, np.nan, 300])

ser1 = pd.Series(ar1)

ar1.mean(), ser1.mean()

(nan, 200.0)

ser2 = dfx.store4
ser2.sum()

38.0

ser2.mean()

19.0

ser2.cumsum()

apple 20.0
banana NaN
kiwi NaN

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 16/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

grapes NaN
mango NaN
watermelon 38.0
oranges NaN
Name: store4, dtype: float64

dfx.store4 + 1

apple 21.0
banana NaN
kiwi NaN
grapes NaN
mango NaN
watermelon 19.0
oranges NaN
Name: store4, dtype: float64

Filling in missing data

filledDf = dfx.fillna(0)
filledDf

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 0.0

banana 18.0 19.0 20.0 0.0 0.0

kiwi 21.0 22.0 23.0 0.0 0.0

grapes 24.0 25.0 26.0 0.0 0.0

mango 27.0 28.0 29.0 0.0 0.0

watermelon 15.0 16.0 17.0 18.0 0.0

oranges 0.0 0.0 0.0 0.0 0.0

dfx.mean()

store1 20.0
store2 21.0
store3 22.0
store4 19.0
store5 NaN
dtype: float64

filledDf.mean()

store1 17.142857
store2 18.000000
store3 18.857143
store4 5.428571

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 17/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store5 0.000000
dtype: float64

Forward and backward filling of the missing values

dfx.store4.fillna(method='ffill')

apple 20.0
banana 20.0
kiwi 20.0
grapes 20.0
mango 20.0
watermelon 18.0
oranges 18.0
Name: store4, dtype: float64

dfx.store4.fillna(method='bfill')

apple 20.0
banana 18.0
kiwi 18.0
grapes 18.0
mango 18.0
watermelon 18.0
oranges NaN
Name: store4, dtype: float64

Filling with index labels

to_fill = pd.Series([14, 23, 12], index=['apple', 'mango', 'oranges'])

to_fill

apple 14
mango 23
oranges 12
dtype: int64

dfx.store4.fillna(to_fill)

apple 20.0
banana NaN
kiwi NaN
grapes NaN
mango 23.0
watermelon 18.0
oranges 12.0
Name: store4, dtype: float64

dfx.fillna(dfx.mean())

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 18/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

store1 store2 store3 store4 store5

apple 15.0 16.0 17.0 20.0 NaN

banana 18.0 19.0 20.0 19.0 NaN

kiwi 21.0 22.0 23.0 19.0 NaN

grapes 24.0 25.0 26.0 19.0 NaN

mango 27.0 28.0 29.0 19.0 NaN

watermelon 15.0 16.0 17.0 18.0 NaN

oranges 20.0 21.0 22.0 19.0 NaN

Interpolation of missing values

ser3 = pd.Series([100, np.nan, np.nan, np.nan, 292])

ser3.interpolate()

0 100.0
1 148.0
2 196.0
3 244.0
4 292.0
dtype: float64

from datetime import datetime

ts = pd.Series([10, np.nan, np.nan, 9],
index=[datetime(2019, 1,1),
datetime(2019, 2,1),
datetime(2019, 3,1),
datetime(2019, 5,1)])

2019-01-01 10.0
2019-02-01 NaN
2019-03-01 NaN
2019-05-01 9.0
dtype: float64

ts.interpolate()

2019-01-01 10.000000
2019-02-01 9.666667
2019-03-01 9.333333
2019-05-01 9.000000
dtype: float64

ts.interpolate(method='time')

2019-01-01 10.000000

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 19/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

2019-02-01 9.741667
2019-03-01 9.508333
2019-05-01 9.000000
dtype: float64

Renaming axis indexes

data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers, columns=['Bergen', 'Oslo', 'Trondheim', 'Stav
dframe1

Bergen Oslo Trondheim Stavanger Kristiansand

Rainfall 0 1 2 3 4

Humidity 5 6 7 8 9

Wind 10 11 12 13 14

# Say, you want to transform the index terms to capital letter.

dframe1.index = dframe1.index.map(str.upper)
dframe1

Bergen Oslo Trondheim Stavanger Kristiansand

RAINFALL 0 1 2 3 4

HUMIDITY 5 6 7 8 9

WIND 10 11 12 13 14

dframe1.rename(index=str.title, columns=str.upper)

BERGEN OSLO TRONDHEIM STAVANGER KRISTIANSAND

Rainfall 0 1 2 3 4

Humidity 5 6 7 8 9

Wind 10 11 12 13 14

Discretization and binning

import pandas as pd

height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

bins = [118, 125, 135, 160, 200]

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 20/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

category = pd.cut(height, bins)

category2 = pd.cut(height, [118, 126, 136, 161, 200], right=False)

category2

[[118, 126), [118, 126), [118, 126), [126, 136), [118, 126), ..., [126, 136), [161,
200), [136, 161), [136, 161), [126, 136)]
Length: 12
Categories (4, interval[int64]): [[118, 126) < [126, 136) < [136, 161) < [161, 200)]

bin_names = ['Short Height', 'Averge height', 'Good Height', 'Taller']

pd.cut(height, bins, labels=bin_names)

[Short Height, Short Height, Short Height, Averge height, Short Height, ..., Averge
height, Taller, Good Height, Good Height, Averge height]
Length: 12
Categories (4, object): [Short Height < Averge height < Good Height < Taller]

# Number of bins as integer

import numpy as np

pd.cut(np.random.rand(40), 5, precision=2)

[(0.21, 0.41], (0.21, 0.41], (0.79, 0.98], (0.02, 0.21], (0.79, 0.98], ..., (0.41,
0.6], (0.02, 0.21], (0.6, 0.79], (0.02, 0.21], (0.6, 0.79]]
Length: 40
Categories (5, interval[float64]): [(0.02, 0.21] < (0.21, 0.41] < (0.41, 0.6] <
(0.6, 0.79] <
(0.79, 0.98]]

randomNumbers = np.random.rand(2000)
category3 = pd.qcut(randomNumbers, 4) # cut into quartiles
category3

[(0.502, 0.758], (0.758, 1.0], (0.502, 0.758], (0.758, 1.0],

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 21/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

(-0.00013600000000000005, 0.239], ..., (0.239, 0.502], (0.239, 0.502], (0.239,

0.502], (0.502, 0.758], (0.758, 1.0]]
Length: 2000
Categories (4, interval[float64]): [(-0.00013600000000000005, 0.239] < (0.239,
0.502] < (0.502, 0.758] < (0.758, 1.0]]

pd.value_counts(category3)

(0.758, 1.0] 500

(0.502, 0.758] 500
(0.239, 0.502] 500
(-0.00013600000000000005, 0.239] 500
dtype: int64

pd.qcut(randomNumbers, [0, 0.3, 0.5, 0.7, 1.0])

[(0.502, 0.709], (0.709, 1.0], (0.502, 0.709], (0.709, 1.0],

(-0.00013600000000000005, 0.291], ..., (0.291, 0.502], (0.291, 0.502], (0.291,
0.502], (0.502, 0.709], (0.709, 1.0]]
Length: 2000
Categories (4, interval[float64]): [(-0.00013600000000000005, 0.291] < (0.291,
0.502] < (0.502, 0.709] < (0.709, 1.0]]

df = pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/hands-on-exploratory-d
df.head(10)

Account Company Order SKU Country Year Quantity

0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148

1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262

2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119

3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097

4 123456787 ABC Dogma 99996 s6-supercomputer Poland 1970 3356

5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474

6 123456783 ABC Dogma 99981 s11-supercomputer Spain 2006 4081

7 123456785 ABC Dogma 99998 s9-supercomputer Belarus 2015 6576

8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460

9 123456775 Kulas Inc 99997 s7-supercomputer French Guiana 2004 1831

df.describe()

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 22/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Account Order Year Quantity UnitPrice

count 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000

mean 1.234568e+08 99989.562900 1994.619800 4985.447300 355.866600

std 5.741156e+00 5.905551 14.432771 2868.949686 201.378478

min 1.234568e+08 99980.000000 1970.000000 0.000000 10.000000

25% 1.234568e+08 99985.000000 1982.000000 2505.750000 181.000000

# Find values in order that exceeded
df['TotalPrice'] = df['UnitPrice']
50% 1.234568e+08 * df['Quantity']
99990.000000 1995.000000 4994.000000 356.000000
df.head(10)
75% 1.234568e+08 99995.000000 2007.000000 7451.500000 531.000000

max Account
1.234568e+08 Company Order2019.000000
99999.000000 SKU
9999.000000 Country Year
700.000000 Quantity

0 123456779 Kulas Inc 99985 s9-supercomputer Aruba 1981 5148

1 123456784 GitHub 99986 s4-supercomputer Brazil 2001 3262

2 123456782 Kulas Inc 99990 s10-supercomputer Montserrat 1973 9119

3 123456783 My SQ Man 99999 s1-supercomputer El Salvador 2015 3097

4 123456787 ABC Dogma 99996 s6-supercomputer Poland 1970 3356

5 123456778 Super Sexy Dingo 99996 s9-supercomputer Costa Rica 2004 2474

6 123456783 ABC Dogma 99981 s11-supercomputer Spain 2006 4081

7 123456785 ABC Dogma 99998 s9-supercomputer Belarus 2015 6576

8 123456778 Loolo INC 99997 s8-supercomputer Mauritius 1999 2460

9 123456775 Kulas Inc 99997 s7-supercomputer French Guiana 2004 1831

# Find transaction exceeded 3000000

TotalTransaction = df["TotalPrice"]
TotalTransaction[np.abs(TotalTransaction) > 3000000]

2 3711433
7 3965328
13 4758900
15 5189372
17 3989325
...
9977 3475824
9984 5251134
9987 5670420
9991 5735513
9996 3018490
Name: TotalPrice, Length: 2094, dtype: int64

df[np.abs(TotalTransaction) > 6741112]

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 23/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Account Company Order SKU Country Year Quantity Un

818 123456781 Gen Power 99991 s1-supercomputer Burkina Faso 1985 9693

1402 123456778 Will LLC 99985 s11-supercomputer Austria 1990 9844

2242 123456770 Name IT 99997 s9-supercomputer Myanmar 1979 9804

2876 123456772 Gen Power 99992 s10-supercomputer Mali 2007 9935

3210 123456782 Loolo INC 99991 s8-supercomputer Kuwait 2006 9886

3629 123456779 My SQ Man 99980 s3-supercomputer Hong Kong 1994 9694

7674 123456781 Loolo INC 99989 s6-supercomputer Sri Lanka 1994 9882

8645 123456789 Gen Power 99996 s11-supercomputer Suriname 2005 9742

Permunation
8684 123456785and
GenRandom
Power 99989 sampling
s2-supercomputer Kenya 2013 9805

dat = np.arange(80).reshape(10,8)
df = pd.DataFrame(dat)

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

1 8 9 10 11 12 13 14 15

2 16 17 18 19 20 21 22 23

3 24 25 26 27 28 29 30 31

4 32 33 34 35 36 37 38 39

5 40 41 42 43 44 45 46 47

6 48 49 50 51 52 53 54 55

7 56 57 58 59 60 61 62 63

8 64 65 66 67 68 69 70 71

9 72 73 74 75 76 77 78 79

sampler = np.random.permutation(10)
sampler

array([1, 5, 3, 6, 2, 4, 9, 0, 7, 8])

df.take(sampler)

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 24/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

0 1 2 3 4 5 6 7

1 8 9 10 11 12 13 14 15

5 40 41 42 43 44 45 46 47

3 24 25 26 27 28 29 30 31

6 48 49 50 51 52 53 54 55

2 16 17 18 19 20 21 22 23

4 32 33 34 35 36 37 38 39

9 72 73 74 75 76 77 78 79

0 0 1 2 3 4 5 6 7

7 56 57 58 59 60 61 62 63
# Random sample without replacement
8 64 65 66 67 68 69 70 71
df.take(np.random.permutation(len(df))[:3])

0 1 2 3 4 5 6 7

9 72 73 74 75 76 77 78 79

2 16 17 18 19 20 21 22 23

0 0 1 2 3 4 5 6 7

# Random sample with replacement

sack = np.array([4, 8, -2, 7, 5])
sampler = np.random.randint(0, len(sack), size = 10)
sampler

array([3, 3, 0, 4, 0, 0, 1, 2, 1, 4])

draw = sack.take(sampler)
draw

array([ 7, 7, 4, 5, 4, 4, 8, -2, 8, 5])

Dummy variables

df = pd.DataFrame({'gender': ['female', 'female', 'male', 'unknown', 'male', 'female'], 'v

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 25/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

gender votes

0 female 6

1 female 7

2 male 8
pd.get_dummies(df['gender'])
3 unknown 9

4 female
male male 10unknown

5
0 female
1 0 11 0

1 1 0 0

2 0 1 0

3 0 0 1

4 0 1 0

5 1 0 0

dummies = pd.get_dummies(df['gender'], prefix='gender')

dummies

gender_female gender_male gender_unknown

0 1 0 0

1 1 0 0

2 0 1 0

3 0 0 1

4 0 1 0

5 1 0 0

with_dummy = df[['votes']].join(dummies)
with_dummy

votes gender_female gender_male gender_unknown

0 6 1 0 0

1 7 1 0 0

2 8 0 1 0

3 9 0 0 1

4 10 0 1 0

5 11 1 0 0

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 26/27
9/7/22, 10:22 AM Chapter_4_Data_Transformation.ipynb - Colaboratory

Colab paid products - Cancel contracts here

https://colab.research.google.com/drive/1nr_x1GO4u_1xOIC3Aq6Cgtz8DWzcyCZW#printMode=true 27/27

Introduction to Pandas DataFrames
100% (1)
Introduction to Pandas DataFrames
21 pages
Exp 3
No ratings yet
Exp 3
10 pages
Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
6 pages
Python DataFrame Techniques
No ratings yet
Python DataFrame Techniques
10 pages
Revision Notes DataFrame XII IP
No ratings yet
Revision Notes DataFrame XII IP
8 pages
Pandas Data Wrangling Cheat Sheet
100% (2)
Pandas Data Wrangling Cheat Sheet
6 pages
Data Integration and Missing Values Analysis
No ratings yet
Data Integration and Missing Values Analysis
23 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Data Frame Demo
No ratings yet
Data Frame Demo
73 pages
Pandas Cheat Sheet
85% (13)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
DataFrame Creation and Manipulation Examples
No ratings yet
DataFrame Creation and Manipulation Examples
19 pages
PRACTICALS
No ratings yet
PRACTICALS
52 pages
Unit3 - 3) Pandas - Ipynb - Colab
No ratings yet
Unit3 - 3) Pandas - Ipynb - Colab
11 pages
Pandas
No ratings yet
Pandas
44 pages
Handling Duplicates in DataFrames
No ratings yet
Handling Duplicates in DataFrames
7 pages
Pandas DataFrame and Series Operations
No ratings yet
Pandas DataFrame and Series Operations
74 pages
Dsbda Assignment 1
No ratings yet
Dsbda Assignment 1
5 pages
Data Ingestion and Reshaping Guide
100% (1)
Data Ingestion and Reshaping Guide
2 pages
Assignments IP Class 12
No ratings yet
Assignments IP Class 12
9 pages
Lab File
No ratings yet
Lab File
96 pages
Essential Pandas DataFrame Guide
No ratings yet
Essential Pandas DataFrame Guide
9 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
Acknowledgement
No ratings yet
Acknowledgement
25 pages
Class 12 IP Practical Questions
No ratings yet
Class 12 IP Practical Questions
7 pages
Unit 1 Python Pandas
No ratings yet
Unit 1 Python Pandas
20 pages
Pandas Library
No ratings yet
Pandas Library
6 pages
Lecture 12 - Art and Science of Data Visualization
No ratings yet
Lecture 12 - Art and Science of Data Visualization
21 pages
Ip Practical File
No ratings yet
Ip Practical File
20 pages
Pandas Plots
No ratings yet
Pandas Plots
14 pages
Unit 3 Python B.SC IT
No ratings yet
Unit 3 Python B.SC IT
18 pages
Data Cleaning with Pandas & NumPy
No ratings yet
Data Cleaning with Pandas & NumPy
20 pages
PDF&Rendition 1
No ratings yet
PDF&Rendition 1
47 pages
Python Pandas Cheat Sheet Guide
No ratings yet
Python Pandas Cheat Sheet Guide
11 pages
Unit IV
No ratings yet
Unit IV
49 pages
Data Mining - Week - 4
No ratings yet
Data Mining - Week - 4
8 pages
Pandas for Data Analysis Beginners
No ratings yet
Pandas for Data Analysis Beginners
19 pages
Cheat Sheet Pandas
No ratings yet
Cheat Sheet Pandas
4 pages
Pandas Cheat Sheet for Data Manipulation
No ratings yet
Pandas Cheat Sheet for Data Manipulation
1 page
Exp 6
No ratings yet
Exp 6
9 pages
Numpy - Pandas - Colab
No ratings yet
Numpy - Pandas - Colab
6 pages
Pandas for Data Science Beginners
No ratings yet
Pandas for Data Science Beginners
21 pages
Cheat Sheet
No ratings yet
Cheat Sheet
15 pages
GR12 Record Programs 6TH Onwards
No ratings yet
GR12 Record Programs 6TH Onwards
18 pages
Creation of Series Using List, Dictionary & Ndarray
No ratings yet
Creation of Series Using List, Dictionary & Ndarray
65 pages
Practical 1
No ratings yet
Practical 1
65 pages
Innovative Assignment PDF
No ratings yet
Innovative Assignment PDF
11 pages
Classwork For GGIS XII 2024-25
No ratings yet
Classwork For GGIS XII 2024-25
1 page
P.no 35 To 52
No ratings yet
P.no 35 To 52
18 pages
PANDAS Cheatsheet
No ratings yet
PANDAS Cheatsheet
4 pages