0% found this document useful (0 votes)
29 views10 pages

Unit3 - Cleaning - Preparing - Data - Jupyter Notebook

Clean data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

Unit3 - Cleaning - Preparing - Data - Jupyter Notebook

Clean data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

Cleaning and Preparing data ¶


In [24]:

import pandas as pd
d=pd.read_excel("C:\\Users\\Admin\\Desktop\\sree.xlsx")
df=pd.DataFrame(d)
df

Out[24]:

NAME AGE WT DSP COA CD

0 Chiru 23.0 30.0 34.0 24.0 40.0

1 Venky 24.0 23.0 23.0 5.0 35.0

2 Balayya 23.0 34.0 NaN 35.0 37.0

3 Nag 35.0 23.0 32.0 NaN 35.0

4 Lakshman 21.0 29.0 10.0 26.0 29.0

5 Suresh 20.0 31.0 31.0 23.0 NaN

6 vijay NaN NaN 23.0 37.0 25.0

7 prabhas 30.0 37.0 29.0 34.0 9.0

8 bunny 28.0 37.0 26.0 29.0 37.0

9 anushka 25.0 27.0 24.0 22.0 NaN

10 pspk 37.0 35.0 23.0 NaN 21.0

11 mahesh 34.0 29.0 17.0 22.0 9.0

12 ntr 32.0 23.0 40.0 22.0 23.0

13 ramcharan 31.0 2.0 26.0 40.0 42.0

14 Lakshman 21.0 29.0 10.0 26.0 29.0

In [ ]:

##HANDLING MISSING VALUES##

####Pandas treat None and NaN as essentially interchangeable for indicating missin
#To facilitate this convention, there are several useful functions for detecting,

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 1/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [3]:

df.isnull()

Out[3]:

NAME AGE WT DSP COA CD

0 False False False False False False

1 False False False False False False

2 False False False True False False

3 False False False False True False

4 False False False False False False

5 False False False False False True

6 False True True False False False

7 False False False False False False

8 False False False False False False

9 False False False False False True

10 False False False False True False

11 False False False False False False

12 False False False False False False

13 False False False False False False

In [4]:

df.dropna()

Out[4]:

NAME AGE WT DSP COA CD

0 Chiru 23.0 30.0 34.0 24.0 40.0

1 Venky 24.0 23.0 23.0 5.0 35.0

4 Lakshman 21.0 29.0 10.0 26.0 29.0

7 prabhas 30.0 37.0 29.0 34.0 9.0

8 bunny 28.0 37.0 26.0 29.0 37.0

11 mahesh 34.0 29.0 17.0 22.0 9.0

12 ntr 32.0 23.0 40.0 22.0 23.0

13 ramcharan 31.0 2.0 26.0 40.0 42.0

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 2/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [5]:

df

Out[5]:

NAME AGE WT DSP COA CD

0 Chiru 23.0 30.0 34.0 24.0 40.0

1 Venky 24.0 23.0 23.0 5.0 35.0

2 Balayya 23.0 34.0 NaN 35.0 37.0

3 Nag 35.0 23.0 32.0 NaN 35.0

4 Lakshman 21.0 29.0 10.0 26.0 29.0

5 Suresh 20.0 31.0 31.0 23.0 NaN

6 vijay NaN NaN 23.0 37.0 25.0

7 prabhas 30.0 37.0 29.0 34.0 9.0

8 bunny 28.0 37.0 26.0 29.0 37.0

9 anushka 25.0 27.0 24.0 22.0 NaN

10 pspk 37.0 35.0 23.0 NaN 21.0

11 mahesh 34.0 29.0 17.0 22.0 9.0

12 ntr 32.0 23.0 40.0 22.0 23.0

13 ramcharan 31.0 2.0 26.0 40.0 42.0

In [7]:

df.dropna(inplace=True)

In [8]:

df

Out[8]:

NAME AGE WT DSP COA CD

0 Chiru 23.0 30.0 34.0 24.0 40.0

1 Venky 24.0 23.0 23.0 5.0 35.0

4 Lakshman 21.0 29.0 10.0 26.0 29.0

7 prabhas 30.0 37.0 29.0 34.0 9.0

8 bunny 28.0 37.0 26.0 29.0 37.0

11 mahesh 34.0 29.0 17.0 22.0 9.0

12 ntr 32.0 23.0 40.0 22.0 23.0

13 ramcharan 31.0 2.0 26.0 40.0 42.0

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 3/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [11]:

import pandas as pd
d=pd.read_excel("C:\\Users\\Admin\\Desktop\\sree.xlsx")
df=pd.DataFrame(d)
df

Out[11]:

NAME AGE WT DSP COA CD

0 Chiru 23.0 30.0 34.0 24.0 40.0

1 Venky 24.0 23.0 23.0 5.0 35.0

2 Balayya 23.0 34.0 NaN 35.0 37.0

3 Nag 35.0 23.0 32.0 NaN 35.0

4 Lakshman 21.0 29.0 10.0 26.0 29.0

5 Suresh 20.0 31.0 31.0 23.0 NaN

6 vijay NaN NaN 23.0 37.0 25.0

7 prabhas 30.0 37.0 29.0 34.0 9.0

8 bunny 28.0 37.0 26.0 29.0 37.0

9 anushka 25.0 27.0 24.0 22.0 NaN

10 pspk 37.0 35.0 23.0 NaN 21.0

11 mahesh 34.0 29.0 17.0 22.0 9.0

12 ntr 32.0 23.0 40.0 22.0 23.0

13 ramcharan 31.0 2.0 26.0 40.0 42.0

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 4/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [12]:

df.fillna(0)

Out[12]:

NAME AGE WT DSP COA CD

0 Chiru 23.0 30.0 34.0 24.0 40.0

1 Venky 24.0 23.0 23.0 5.0 35.0

2 Balayya 23.0 34.0 0.0 35.0 37.0

3 Nag 35.0 23.0 32.0 0.0 35.0

4 Lakshman 21.0 29.0 10.0 26.0 29.0

5 Suresh 20.0 31.0 31.0 23.0 0.0

6 vijay 0.0 0.0 23.0 37.0 25.0

7 prabhas 30.0 37.0 29.0 34.0 9.0

8 bunny 28.0 37.0 26.0 29.0 37.0

9 anushka 25.0 27.0 24.0 22.0 0.0

10 pspk 37.0 35.0 23.0 0.0 21.0

11 mahesh 34.0 29.0 17.0 22.0 9.0

12 ntr 32.0 23.0 40.0 22.0 23.0

13 ramcharan 31.0 2.0 26.0 40.0 42.0

In [16]:

# Filling Columns with Different Values


df = df.fillna({'COA': 25})
print(df)

NAME AGE WT DSP COA CD


0 Chiru 23.0 30.0 34.0 24.0 40.0
1 Venky 24.0 23.0 23.0 5.0 35.0
2 Balayya 23.0 34.0 25.0 35.0 37.0
3 Nag 35.0 23.0 32.0 25.0 35.0
4 Lakshman 21.0 29.0 10.0 26.0 29.0
5 Suresh 20.0 31.0 31.0 23.0 NaN
6 vijay NaN NaN 23.0 37.0 25.0
7 prabhas 30.0 37.0 29.0 34.0 9.0
8 bunny 28.0 37.0 26.0 29.0 37.0
9 anushka 25.0 27.0 24.0 22.0 NaN
10 pspk 37.0 35.0 23.0 25.0 21.0
11 mahesh 34.0 29.0 17.0 22.0 9.0
12 ntr 32.0 23.0 40.0 22.0 23.0
13 ramcharan 31.0 2.0 26.0 40.0 42.0

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 5/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [18]:

# Imputing a Missing Value with mean


df['CD'] = df['CD'].fillna(df['CD'].mean())
print(df)

NAME AGE WT DSP COA CD


0 Chiru 23.0 30.0 34.0 24.0 40.0
1 Venky 24.0 23.0 23.0 5.0 35.0
2 Balayya 23.0 34.0 25.0 35.0 37.0
3 Nag 35.0 23.0 32.0 25.0 35.0
4 Lakshman 21.0 29.0 10.0 26.0 29.0
5 Suresh 20.0 31.0 31.0 23.0 28.5
6 vijay NaN NaN 23.0 37.0 25.0
7 prabhas 30.0 37.0 29.0 34.0 9.0
8 bunny 28.0 37.0 26.0 29.0 37.0
9 anushka 25.0 27.0 24.0 22.0 28.5
10 pspk 37.0 35.0 23.0 25.0 21.0
11 mahesh 34.0 29.0 17.0 22.0 9.0
12 ntr 32.0 23.0 40.0 22.0 23.0
13 ramcharan 31.0 2.0 26.0 40.0 42.0

In [21]:

import pandas as pd
d=pd.read_excel("C:\\Users\\Admin\\Desktop\\sree.xlsx")
df=pd.DataFrame(d)
print(df.duplicated())

0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
dtype: bool

In [22]:

# Counting Duplicate Records in a DataFrame


print(df.duplicated().sum())

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 6/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [23]:

# Dropping Duplicates with Default Arguments


df = df.drop_duplicates()
print(df)

NAME AGE WT DSP COA CD


0 Chiru 23.0 30.0 34.0 24.0 40.0
1 Venky 24.0 23.0 23.0 5.0 35.0
2 Balayya 23.0 34.0 NaN 35.0 37.0
3 Nag 35.0 23.0 32.0 NaN 35.0
4 Lakshman 21.0 29.0 10.0 26.0 29.0
5 Suresh 20.0 31.0 31.0 23.0 NaN
6 vijay NaN NaN 23.0 37.0 25.0
7 prabhas 30.0 37.0 29.0 34.0 9.0
8 bunny 28.0 37.0 26.0 29.0 37.0
9 anushka 25.0 27.0 24.0 22.0 NaN
10 pspk 37.0 35.0 23.0 NaN 21.0
11 mahesh 34.0 29.0 17.0 22.0 9.0
12 ntr 32.0 23.0 40.0 22.0 23.0
13 ramcharan 31.0 2.0 26.0 40.0 42.0

In [ ]:

###Data Formatting###

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 7/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [9]:

import xlsxwriter
import csv

book=xlsxwriter.Workbook("dsp.xlsx")
Campus_Name="Rkvalley"
Branch_Name="CSE"
Section_Name="C"

format1=book.add_format({'bg_color':"orange",'border':1})
format2=book.add_format({'bg_color':"purple",'border':1})
s=book.add_worksheet("dsp")

s.write(1,0,"Campus name",format1)
s.write(1,1,Campus_Name,format2)
s.write(2,0,"Branch name",format1)
s.write(2,1,Branch_Name,format2)
s.write(3,0,"Section name",format1)
s.write(3,1,Section_Name,format2)

index=5
with open("stup.csv") as csvfile:
csv_reader= csv.reader(csvfile)
for row in csv_reader:
if index==5:
format=format1
else:
format=format2

s.write(index,0,row[0],format)
s.write(index,1,row[1],format)
s.write(index,2,row[2],format)
s.write(index,3,row[3],format)
s.write(index,4,row[4],format)
s.write(index,5,row[5],format)

index+= 1

book.close()

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 8/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [51]:

###BINNING##

import pandas as pd
d=pd.read_excel("C:\\Users\\Admin\\Desktop\\udaya.xlsx")
df=pd.DataFrame(d)
df

Out[51]:

NAME AGE WT DSP COA CD TOTAL

0 Chiru 23 30 34 24 40 128

1 Venky 24 23 23 5 35 86

2 Balayya 23 34 20 35 37 126

3 Nag 35 23 32 29 35 119

4 Lakshman 21 29 10 26 29 94

5 Suresh 20 31 31 23 28 113

6 vijay 27 2 23 37 25 87

7 prabhas 30 37 29 34 9 109

8 bunny 28 37 26 29 37 129

9 anushka 25 27 24 22 33 106

10 pspk 37 35 23 35 21 114

11 mahesh 34 29 17 22 9 77

12 ntr 32 23 40 22 23 108

13 ramcharan 31 2 26 40 42 110

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 9/10


04/05/2023, 12:58 Unit3-cleaning,preparing data - Jupyter Notebook

In [52]:

bins=[70,90,110,150]
group_names=['fail','average','good']
df['status']= pd.cut(df["TOTAL"],bins,labels=group_names)
df

Out[52]:

NAME AGE WT DSP COA CD TOTAL status

0 Chiru 23 30 34 24 40 128 good

1 Venky 24 23 23 5 35 86 fail

2 Balayya 23 34 20 35 37 126 good

3 Nag 35 23 32 29 35 119 good

4 Lakshman 21 29 10 26 29 94 average

5 Suresh 20 31 31 23 28 113 good

6 vijay 27 2 23 37 25 87 fail

7 prabhas 30 37 29 34 9 109 average

8 bunny 28 37 26 29 37 129 good

9 anushka 25 27 24 22 33 106 average

10 pspk 37 35 23 35 21 114 good

11 mahesh 34 29 17 22 9 77 fail

12 ntr 32 23 40 22 23 108 average

13 ramcharan 31 2 26 40 42 110 average

In [ ]:

localhost:8888/notebooks/anaconda3/Python/Unit3-cleaning%2Cpreparing data.ipynb 10/10

You might also like