0% found this document useful (0 votes)
57 views54 pages

Pandas Notes

Pandas is a Python library used for working with and analyzing data. It allows loading, cleaning, and manipulating data from many formats. Pandas provides Series for 1D data and DataFrame for 2D data. It allows operations on data like arithmetic, logical operations, reading/writing to different file formats and more.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views54 pages

Pandas Notes

Pandas is a Python library used for working with and analyzing data. It allows loading, cleaning, and manipulating data from many formats. Pandas provides Series for 1D data and DataFrame for 2D data. It allows operations on data like arithmetic, logical operations, reading/writing to different file formats and more.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

yweeo2hxb

May 8, 2023

1 What is pandas?
Pandas is a python library used for working with data sets It has fucntion for analyzing , cleaning
, exploration, and manipulation of data. Read and write data in differenet format like : xlxs, csv,
txt, JSON etc
Series = 1-d labeled array pd.series(data)
DataFrame = 2-d labeled array much like a table.: pd.dataframes(data)
Panel = A panel is a 3d container of data.
#Series and DataFrame is case sensitive do not use in small case.

2 Why Use Pandas?


Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.

3 Series
[30]: import pandas as pd
var = pd.Series([1,3,45,678,90])
print(var)
print(type(var))

#getting values using index no


print(var[3]) #it gives ans:678
print()

#changing index to other format like abc


print(pd.Series(var, index =['A','B','C','D','E']))
print()

#change the datatype into float


print(pd.Series(var,dtype='float'))

1
print()

#series with tuple and dictionary


var = pd.Series((1,3,45,678,90)) #tuple data
var2 = pd.Series({'A':[1,2,34,5,6],'B':[2,23,34,56,4]}) #dictionary data
print(var)
print(var2)

#creating series with single data only


var =pd.Series(34)
print(var)

#creating series with single data but we want 5 data of value 12


var =pd.Series(12 , index =[1,2,3,4,5])
print(var)

0 1
1 3
2 45
3 678
4 90
dtype: int64
<class 'pandas.core.series.Series'>
678

A NaN
B NaN
C NaN
D NaN
E NaN
dtype: float64

0 1.0
1 3.0
2 45.0
3 678.0
4 90.0
dtype: float64

0 1
1 3
2 45
3 678

2
4 90
dtype: int64
A [1, 2, 34, 5, 6]
B [2, 23, 34, 56, 4]
dtype: object
0 34
dtype: int64
1 12
2 12
3 12
4 12
5 12
dtype: int64

4 DataFrame
[33]: var = pd.DataFrame([[1,3,45,678,90,3],[2,3,45,67,8,3]])
print(var)
print(type(var))

#dataframe using dictionary


var ={"A" :[1,23,4,5] ,"B":[1,2,34,5],"C":['SD','DF','HK','KJ']}
a = pd.DataFrame(var)
print(a)

0 1 2 3 4 5
0 1 3 45 678 90 3
1 2 3 45 67 8 3
<class 'pandas.core.frame.DataFrame'>
A B C
0 1 1 SD
1 23 2 DF
2 4 34 HK
3 5 5 KJ
If you ever pass data with dictionary in dataframe make sure both the dict data value should be
equal other wise it will give you a error
errro = All arrays must be of the same length.
[669]: var ={"A" :[1,23,4,5] ,"B":[1,2,34,5],"C":['SD','DF','HK','KJ']}
a = pd.DataFrame(var, columns=["A","B"]) #get A column or multiple columns
print(a)

#get all columns names


print(a.columns)

3
#getting value using index in dataframe
# syntax: (var[column_name][index_number])
print(var["A"][3]) #ans =5

A B
0 1 1
1 23 2
2 4 34
3 5 5
Index(['A', 'B'], dtype='object')
5

5 Arithmetic operation in Pandas


[64]: var = {'A':[1,2,34,56,7],'B':[12,3,456,7,23]}
var = pd.DataFrame(var)
print(var)
print()

#add
var["add"] = var["A"]+var["B"]

#sub
var["subtract"] = var["A"]-var["B"]

#multiply
var["multiply"] = var["A"]*var["B"]

#division
var["division"] = var["A"]/var["B"]

#modules
var["modulus"] = var["A"]%var["B"]
print(var)
print()

A B
0 1 12
1 2 3
2 34 456
3 56 7

4
4 7 23

A B add subtract multiply division modulus


0 1 12 13 -11 12 0.083333 1
1 2 3 5 -1 6 0.666667 2
2 34 456 490 -422 15504 0.074561 34
3 56 7 63 49 392 8.000000 0
4 7 23 30 -16 161 0.304348 7

6 Logical operation in Pandas


[78]: a = {'A':[1,2,34,56,7],'B':[12,3,456,7,23]}
var = pd.DataFrame(a)
print(var)
print()

# < , > , == , & (and)


if condition is true it return true either return false

print(var["A"]>4)
print()
print(var["B"]<12)
print()
print(var["A"]==7)
print()
print((var["A"] >7) & (var["B"] >9))

A B
0 1 12
1 2 3
2 34 456
3 56 7
4 7 23

0 False
1 False
2 True
3 True
4 True
Name: A, dtype: bool

0 False
1 True
2 False
3 True

5
4 False
Name: B, dtype: bool

0 False
1 False
2 False
3 False
4 True
Name: A, dtype: bool

0 False
1 False
2 True
3 False
4 False
dtype: bool

7 Insert in pandas
[95]: df =pd.DataFrame({'A':[1,2,34,56,7],'B':[12,3,456,7,23]})
print(df)

# syntax : insert(index number, new column name, data)


#A index is 0 and b index is 1 so it will insert after A in index position 1
df.insert(1,"C",var['A']) # taking all data from col A
df

#creating a new column and get the limited data from the column A

df['NEW'] =df['A'][0:4] #slicing


df

A B
0 1 12
1 2 3
2 34 456
3 56 7
4 7 23

[95]: A C B NEW
0 1 1 12 1.0
1 2 2 3 2.0
2 34 34 456 34.0
3 56 56 7 56.0

6
4 7 7 23 NaN

8 Delete in pandas
#pop() - function is used to delele the data from pandas

[417]: df =pd.DataFrame({'A':[1,2,34,56,7],'B':[12,3,456,7,23]})
print(df)
print()

# syntax :dataframe.pop(label)

#delte a partcular column


df.pop('A')
df

A B
0 1 12
1 2 3
2 34 456
3 56 7
4 7 23

[417]: B
0 12
1 3
2 456
3 7
4 23

9 Read & write in csv


[108]: import pandas as pd

# write csv
var =pd.DataFrame({"A":[1,4,74,23,63], "B":[34,987,23,23,52]})
print(var)
var.to_csv("test.csv",index=False, header=["name","id "])
print()

#if you dont want index no make it false


#index=False

#if you dont want To change the column name

7
#header=["name","id "]

A B
0 1 34
1 4 987
2 74 23
3 23 23
4 63 52

[122]: #Read csv


df=pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\test.csv")
print(df)

# Syntax : nrows=3 : if you want particular row from dataset


df=pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\test.csv",␣
↪nrows=2)

print(df)
print()

# Syntax : usecols= [column_name,column_name] : if you want particular␣


column from dataset

df=pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\test.csv",␣


↪usecols=['name'])

print(df)
print()

# Syntax : skiprows= [index number,index number] : it will skip that row␣


from data set

df=pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\test.csv",␣


↪skiprows =[2,3])

print(df)
print()

name id
0 1 34
1 4 987
2 74 23
3 23 23
4 63 52
name id
0 1 34
1 4 987

name

8
0 1
1 4
2 74
3 23
4 63

name id
0 1 34
1 23 23
2 63 52

JSON = Python Dictionary


JSON objects have the same format as Python dictionaries.

10 Read excel
[132]: df=pd.read_excel("C:\\Users\\sanram\\Videos\\Jupyter python practise\\book2.
↪xlsx")

df

# by default both method return 5 rows


df.head(3) #return the top 3 rows
print()
df.tail(4) #return the bottom 4 rows

[132]: 2022-01-01 00:00:00 Unnamed: 1 Unnamed: 2 \


325 artf07582961 2022-12-07 00:00:00 2022-12-06 00:00:00
326 artf07583723 2022-12-09 00:00:00 2022-12-09 00:00:00
327 artf07582479 2022-12-07 00:00:00 2022-12-06 00:00:00
328 artf07582966 2022-12-07 00:00:00 2022-12-07 00:00:00
329 artf07582967 2022-11-30 00:00:00 2022-11-30 00:00:00

Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 \


325 KBS V2023-01 8.3.38 UWV Average
326 IMF V2021-02.1 8.3.38 UWV Complex
327 PDI V22.00.06 8.3.38 UWV Average
328 WWO_BACKEND V22.4 8.3.38 UWV Highly Complex
329 DMO 20223.0.4 8.3.38 UWV NaN

Unnamed: 8 Unnamed: 9 … Unnamed: 13 Unnamed: 14 \


325 Suryawanshi, Rajendra Netherland … FRPR3ANA05PR 12
326 Jacobs, Rob Netherland … FRPR3ANA05PR 12
327 Suryawanshi, Rajendra Netherland … FRPR3ANA05PR 12

9
328 Saravanan, Vennila Netherland … FRPR3ANA05PR 12
329 Tapessur, Ray Netherland … FRPR3ANA05PR 12

Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 \


325 ID#316905. 5 .NET;#C#;#Visual Basic Recurring 2022
326 ID#319028 4 Mainfram / COBOL Recurring 2022
327 ID#316903 3 PL/SQL Recurring 2022
328 ID#315664 9 (None);#Mainfram / COBOL Recurring 2022
329 ID#315284 7 .NET;#C# Recurring 2022

Unnamed: 20 Unnamed: 21 Unnamed: 22


325 Sawant, Snehal QGaaS NL_UWV_KBS_5th Run(8.3.38)
326 Sawant, Snehal QGaaS NL_UWV_IMF_4th Run(8.3.38)
327 Sawant, Snehal QGaaS NL_UWV_ PDI_3rd Run (8.3.38)
328 A, Thrisha QGaaS NL_UWV_WWO_BACKEND_9th Run(8.3.38)
329 Sawant, Snehal QGaaS NL_UWV_DMO_7th Run (8.3.38)

[5 rows x 23 columns]

11 Pandas Functions
[133]: df =pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\pandas␣
↪case study\\blackfriday.csv")

df

[133]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 P00069042 F 0-17 10 A
1 1000001 P00248942 F 0-17 10 A
2 1000001 P00087842 F 0-17 10 A
3 1000001 P00085442 F 0-17 10 A
4 1000002 P00285442 M 55+ 16 C
… … … … … … …
550063 1006033 P00372445 M 51-55 13 B
550064 1006035 P00375436 F 26-35 1 C
550065 1006036 P00375436 F 26-35 15 B
550066 1006038 P00375436 F 55+ 1 C
550067 1006039 P00371644 F 46-50 0 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 0 3
1 2 0 1
2 2 0 12
3 2 0 12
4 4+ 0 8
… … … …
550063 1 1 20

10
550064 3 0 20
550065 4+ 1 20
550066 2 0 20
550067 4+ 1 20

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370
1 6.0 14.0 15200
2 NaN NaN 1422
3 14.0 NaN 1057
4 NaN NaN 7969
… … … …
550063 NaN NaN 368
550064 NaN NaN 371
550065 NaN NaN 137
550066 NaN NaN 365
550067 NaN NaN 490

[550068 rows x 12 columns]

[135]: # if you want only index no

df.index

[135]: RangeIndex(start=0, stop=550068, step=1)

[137]: #get all columns names


df.columns

[137]: Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',


'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
'Product_Category_2', 'Product_Category_3', 'Purchase'],
dtype='object')

[139]: df.describe()
# this function give u all the details about aggregate method
# like max, count, sum, avg etc

[139]: User_ID Occupation Marital_Status Product_Category_1 \


count 5.500680e+05 550068.000000 550068.000000 550068.000000
mean 1.003029e+06 8.076707 0.409653 5.404270
std 1.727592e+03 6.522660 0.491770 3.936211
min 1.000001e+06 0.000000 0.000000 1.000000
25% 1.001516e+06 2.000000 0.000000 1.000000
50% 1.003077e+06 7.000000 0.000000 5.000000
75% 1.004478e+06 14.000000 1.000000 8.000000
max 1.006040e+06 20.000000 1.000000 20.000000

11
Product_Category_2 Product_Category_3 Purchase
count 376430.000000 166821.000000 550068.000000
mean 9.842329 12.668243 9263.968713
std 5.086590 4.125338 5023.065394
min 2.000000 3.000000 12.000000
25% 5.000000 9.000000 5823.000000
50% 9.000000 14.000000 8047.000000
75% 15.000000 16.000000 12054.000000
max 18.000000 18.000000 23961.000000

[142]: # getting data from fixed range using slicing

df[9:14]

[142]: User_ID Product_ID Gender Age Occupation City_Category \


9 1000005 P00274942 M 26-35 20 A
10 1000005 P00251242 M 26-35 20 A
11 1000005 P00014542 M 26-35 20 A
12 1000005 P00031342 M 26-35 20 A
13 1000005 P00145042 M 26-35 20 A

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


9 1 1 8
10 1 1 5
11 1 1 8
12 1 1 8
13 1 1 1

Product_Category_2 Product_Category_3 Purchase


9 NaN NaN 7871
10 11.0 NaN 5254
11 NaN NaN 3957
12 NaN NaN 6073
13 2.0 5.0 15665

12 Rename column names


The rename() method allows you to change the row indexes, and the columns labels.
syntax : df.rename(columns={old_column_name , new_column_name})

[631]: new = df.rename(columns={"User_ID":"col"})


new.head()

[631]: col Product_ID Gender Age Occupation City_Category \


6 1000004 P00184942 M 46-50 7.0 B

12
13 1000005 P00145042 M 26-35 20.0 A
18 1000007 P00036842 M 36-45 1.0 B
19 1000008 P00249542 M 26-35 12.0 C
24 1000008 P00303442 M 26-35 12.0 C

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


6 2 1.0 1
13 1 1.0 1
18 1 1.0 1
19 4+ 1.0 1
24 4+ 1.0 1

Product_Category_2 Product_Category_3 Purchase


6 8.0 17.0 19215
13 2.0 5.0 15665
18 14.0 16.0 11788
19 5.0 15.0 19614
24 8.0 14.0 11927

13 Sort
We can use the sort_index() method to sort the object by labels.

[149]: #sorting of dataset or Reverse the dataset in descending order

#axis =0 means along with rows


#axis =1 means along with columns

#sort th data row wise in asending or descending order


df.sort_index(axis=0,ascending=False)

[149]: User_ID Product_ID Gender Age Occupation City_Category \


550067 1006039 P00371644 F 46-50 0 B
550066 1006038 P00375436 F 55+ 1 C
550065 1006036 P00375436 F 26-35 15 B
550064 1006035 P00375436 F 26-35 1 C
550063 1006033 P00372445 M 51-55 13 B
… … … … … … …
4 1000002 P00285442 M 55+ 16 C
3 1000001 P00085442 F 0-17 10 A
2 1000001 P00087842 F 0-17 10 A
1 1000001 P00248942 F 0-17 10 A
0 1000001 P00069042 F 0-17 10 A

13
Stay_In_Current_City_Years Marital_Status Product_Category_1 \
550067 4+ 1 20
550066 2 0 20
550065 4+ 1 20
550064 3 0 20
550063 1 1 20
… … … …
4 4+ 0 8
3 2 0 12
2 2 0 12
1 2 0 1
0 2 0 3

Product_Category_2 Product_Category_3 Purchase


550067 NaN NaN 490
550066 NaN NaN 365
550065 NaN NaN 137
550064 NaN NaN 371
550063 NaN NaN 368
… … … …
4 NaN NaN 7969
3 14.0 NaN 1057
2 NaN NaN 1422
1 6.0 14.0 15200
0 NaN NaN 8370

[550068 rows x 12 columns]

[558]: # Sort dataset column wise


df.sort_index(axis=1).head(4)

[558]: Age City_Category Gender Marital_Status Occupation \


1 0-17 A F 0 10
6 46-50 B M 1 7
13 26-35 A M 1 20
14 51-55 A F 0 9

Product_Category_1 Product_Category_2 Product_Category_3 Product_ID \


1 1 6.0 14.0 P00248942
6 1 8.0 17.0 P00184942
13 1 2.0 5.0 P00145042
14 5 8.0 14.0 P00231342

Purchase Stay_In_Current_City_Years User_ID


1 15200 2 1000001
6 19215 2 1000004
13 15665 1 1000005

14
14 5378 1 1000006

14 Sort by values
Pandas provides sort_values() method to sort by values.
It accepts a by argument which will use the column name of the
DataFrame with which the values are to be sorted.
[560]: # Lets try to sort dataset using Purchase column

df.sort_values(by=['Purchase']).head()

[560]: User_ID Product_ID Gender Age Occupation City_Category \


377309 1004048 P00041442 F 36-45 1 B
411541 1003391 P00041442 M 18-25 4 A
233619 1006025 P00041442 F 26-35 1 B
5466 1000889 P00041442 M 46-50 20 A
172340 1002660 P00087142 M 55+ 17 C

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


377309 1 0 13
411541 0 0 13
233619 1 0 13
5466 1 0 13
172340 0 0 13

Product_Category_2 Product_Category_3 Purchase


377309 14.0 16.0 185
411541 14.0 16.0 185
233619 14.0 16.0 186
5466 14.0 16.0 186
172340 14.0 16.0 187

[561]: # Sort by multiple columns

df.sort_values(by=['Age','Purchase']).head(20)

[561]: User_ID Product_ID Gender Age Occupation City_Category \


320841 1001421 P00173042 F 0-17 10 A
325430 1002060 P00041442 M 0-17 1 C
162101 1001088 P00173042 F 0-17 10 A
115012 1005757 P00173042 M 0-17 10 C
355134 1000737 P00173042 M 0-17 19 A
402758 1001928 P00173042 M 0-17 10 B
152417 1005555 P00283142 M 0-17 10 B
329565 1002806 P00173042 M 0-17 10 B

15
405177 1002288 P00173042 M 0-17 10 B
279302 1001084 P00164042 M 0-17 19 C
336891 1003843 P00003442 F 0-17 10 B
302525 1004541 P00003442 M 0-17 10 B
482756 1002288 P00030942 M 0-17 10 B
281904 1001434 P00003442 F 0-17 10 A
116604 1006006 P00053842 F 0-17 0 C
229649 1005420 P00187342 F 0-17 19 B
515747 1001434 P0096442 F 0-17 10 A
396449 1001054 P00053842 M 0-17 10 C
329590 1002810 P00187342 F 0-17 10 B
240346 1001096 P00173042 M 0-17 10 C

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


320841 1 0 13
325430 3 0 13
162101 3 0 13
115012 2 0 13
355134 2 0 13
402758 1 0 13
152417 2 0 13
329565 4+ 0 13
405177 2 0 13
279302 0 0 4
336891 2 0 4
302525 1 0 4
482756 2 0 4
281904 0 0 4
116604 1 0 4
229649 4+ 0 4
515747 0 0 4
396449 1 0 4
329590 2 0 4
240346 1 0 13

Product_Category_2 Product_Category_3 Purchase


320841 15.0 16.0 197
325430 14.0 16.0 383
162101 15.0 16.0 400
115012 15.0 16.0 560
355134 15.0 16.0 572
402758 15.0 16.0 577
152417 14.0 16.0 582
329565 15.0 16.0 585
405177 15.0 16.0 591
279302 5.0 8.0 693
336891 5.0 8.0 696

16
302525 5.0 8.0 700
482756 5.0 9.0 706
281904 5.0 8.0 713
116604 5.0 12.0 718
229649 5.0 15.0 731
515747 5.0 12.0 740
396449 5.0 12.0 745
329590 5.0 15.0 747
240346 15.0 16.0 748

[563]: # Sort in descending order


df.sort_values(by='Occupation', ascending=False)

[563]: User_ID Product_ID Gender Age Occupation City_Category \


107410 1004508 P00193642 M 26-35 20 A
470901 1000551 P00195042 M 36-45 20 A
178427 1003600 P00182142 M 36-45 20 B
470979 1000567 P00190842 M 36-45 20 C
49391 1001585 P00000642 M 55+ 20 C
… … … … … … …
282485 1001509 P00165442 M 0-17 0 B
187688 1004972 P00110742 M 36-45 0 C
187689 1004972 P00105342 M 36-45 0 C
490144 1003538 P00015842 M 36-45 0 C
370225 1003049 P00078742 F 51-55 0 C

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


107410 2 0 2
470901 4+ 1 6
178427 4+ 1 1
470979 3 0 2
49391 1 1 1
… … … …
282485 3 0 1
187688 2 0 1
187689 2 0 1
490144 2 1 1
370225 0 1 5

Product_Category_2 Product_Category_3 Purchase


107410 3.0 4.0 13058
470901 8.0 14.0 15929
178427 5.0 6.0 15549
470979 5.0 12.0 13105
49391 6.0 16.0 11609
… … … …
282485 15.0 16.0 4371

17
187688 2.0 8.0 19497
187689 2.0 15.0 15737
490144 14.0 16.0 15596
370225 8.0 14.0 7119

[166821 rows x 12 columns]

15 loc[]
The loc property used to gets, or sets, the value(s) of the specified labels.
Syntax—- Loc[index no, ”column name]

[166]: #set the data


df.loc[1,"Gender"] ="M"
df.head(5)
print()

#get the particluar column


df.loc[:,['User_ID','Occupation']].head(3)

# Lets try to print all columns for 2nd , 3rd and 4th rows
df.loc[2:4]

#get the rows with particular column


print(df.loc[[3,5],["User_ID","Product_ID"]])

User_ID Product_ID
3 1000001 P00085442
5 1000003 P00193542

16 iloc[]
it is used to get or set the data of a particular cell
syntax : df.iloc[row index ,column index]

[175]: # get the cell data


df.iloc[0,1]
df

#print first row using iloc


df1.iloc[0]

18
#print last row using iloc
df.iloc[-1]

#set the cell data


df.iloc[0,1]="santosh"
df

[175]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 santosh F 0-17 10 A
1 1000001 P00248942 M 0-17 10 A
2 1000001 P00087842 F 0-17 10 A
3 1000001 P00085442 F 0-17 10 A
4 1000002 P00285442 M 55+ 16 C
… … … … … … …
550063 1006033 P00372445 M 51-55 13 B
550064 1006035 P00375436 F 26-35 1 C
550065 1006036 P00375436 F 26-35 15 B
550066 1006038 P00375436 F 55+ 1 C
550067 1006039 P00371644 F 46-50 0 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 0 3
1 2 0 1
2 2 0 12
3 2 0 12
4 4+ 0 8
… … … …
550063 1 1 20
550064 3 0 20
550065 4+ 1 20
550066 2 0 20
550067 4+ 1 20

Product_Category_2 Product_Category_3 Purchase 3


0 NaN NaN 8370 NaN
1 6.0 14.0 15200 M
2 NaN NaN 1422 NaN
3 14.0 NaN 1057 NaN
4 NaN NaN 7969 NaN
… … … … …
550063 NaN NaN 368 NaN
550064 NaN NaN 371 NaN
550065 NaN NaN 137 NaN
550066 NaN NaN 365 NaN
550067 NaN NaN 490 NaN

19
[550068 rows x 13 columns]

17 Drop
drop() - Drops the specified rows/columns from the DataFrame
syntax: df.drop[column name , axis=1]
axis=0 belongs to rows & axis =1 belongs to columns
[454]: df =pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\pandas␣
↪case study\\blackfriday.csv")

df.head(4)

[454]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 P00069042 F 0-17 10 A
1 1000001 P00248942 F 0-17 10 A
2 1000001 P00087842 F 0-17 10 A
3 1000001 P00085442 F 0-17 10 A

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 0 3
1 2 0 1
2 2 0 12
3 2 0 12

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370
1 6.0 14.0 15200
2 NaN NaN 1422
3 14.0 NaN 1057

[459]: #delete single colum


df.drop(["User_ID"],axis=1).head(5)

[459]: Product_ID Gender Age Occupation City_Category \


0 P00069042 F 0-17 10 A
1 P00248942 F 0-17 10 A
2 P00087842 F 0-17 10 A
3 P00085442 F 0-17 10 A
4 P00285442 M 55+ 16 C

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 0 3
1 2 0 1
2 2 0 12
3 2 0 12

20
4 4+ 0 8

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370
1 6.0 14.0 15200
2 NaN NaN 1422
3 14.0 NaN 1057
4 NaN NaN 7969

[458]: #delete multiple columns


df.drop(['User_ID', 'Product_ID'], axis=1).head(5)

[458]: Gender Age Occupation City_Category Stay_In_Current_City_Years \


0 F 0-17 10 A 2
1 F 0-17 10 A 2
2 F 0-17 10 A 2
3 F 0-17 10 A 2
4 M 55+ 16 C 4+

Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 \


0 0 3 NaN NaN
1 0 1 6.0 14.0
2 0 12 NaN NaN
3 0 12 14.0 NaN
4 0 8 NaN NaN

Purchase
0 8370
1 15200
2 1422
3 1057
4 7969

[447]: #delete single rows


df.drop(2,axis=0).head(4)

[447]: A B C
0 1 4 1
1 2 5 3

[414]: #delete by multiple rows


df.drop([3,4,5],axis=0).head(5)

[414]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 P00069042 F 0-17 10 A
1 1000001 P00248942 F 0-17 10 A
2 1000001 P00087842 F 0-17 10 A

21
6 1000004 P00184942 M 46-50 7 B
7 1000004 P00346142 M 46-50 7 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 0 3
1 2 0 1
2 2 0 12
6 2 1 1
7 2 1 1

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370
1 6.0 14.0 15200
2 NaN NaN 1422
6 8.0 17.0 19215
7 15.0 NaN 15854

18 Check duplicate
The duplicated() method returns a Boolean values for each row

[719]: data = {
"name": ["Sally", "Mary", "John", "Mary"],
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}

df = pd.DataFrame(data)

#it will mark the duplicates in boolean format which is in "true"


newdf = df.duplicated()
newdf

[719]: 0 False
1 False
2 False
3 True
dtype: bool

[725]: #it will show the row which is duplicate


df[df.duplicated()]

[725]: name age qualified


3 Mary 40 False

22
[724]: #if you want to check the duplicate in particular column then:
df[df.duplicated("qualified")]

[724]: name age qualified


2 John 30 False
3 Mary 40 False

19 drop Duplicates
[181]: data = {
"name": ["Sally", "Mary", "John", "Mary"],
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}

df = pd.DataFrame(data)

newdf = df.drop_duplicates()
newdf

[181]: name age qualified


0 Sally 50 True
1 Mary 40 False
2 John 30 False

20 dropna
dropna()- The dropna() method removes the rows that contains NULL values.
The dropna() method returns a new DataFrame object unless the inplace parameter is set to True,

[187]: df=pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\pandas case␣


↪study\\blackfriday.csv")

df

[187]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 P00069042 F 0-17 10 A
1 1000001 P00248942 F 0-17 10 A
2 1000001 P00087842 F 0-17 10 A
3 1000001 P00085442 F 0-17 10 A
4 1000002 P00285442 M 55+ 16 C
… … … … … … …
550063 1006033 P00372445 M 51-55 13 B
550064 1006035 P00375436 F 26-35 1 C
550065 1006036 P00375436 F 26-35 15 B
550066 1006038 P00375436 F 55+ 1 C

23
550067 1006039 P00371644 F 46-50 0 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 0 3
1 2 0 1
2 2 0 12
3 2 0 12
4 4+ 0 8
… … … …
550063 1 1 20
550064 3 0 20
550065 4+ 1 20
550066 2 0 20
550067 4+ 1 20

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370
1 6.0 14.0 15200
2 NaN NaN 1422
3 14.0 NaN 1057
4 NaN NaN 7969
… … … …
550063 NaN NaN 368
550064 NaN NaN 371
550065 NaN NaN 137
550066 NaN NaN 365
550067 NaN NaN 490

[550068 rows x 12 columns]

[190]: #drop missing values from the dataset


d=df.dropna()
d

[190]: User_ID Product_ID Gender Age Occupation City_Category \


1 1000001 P00248942 F 0-17 10 A
6 1000004 P00184942 M 46-50 7 B
13 1000005 P00145042 M 26-35 20 A
14 1000006 P00231342 F 51-55 9 A
16 1000006 P0096642 F 51-55 9 A
… … … … … … …
545902 1006039 P00064042 F 46-50 0 B
545904 1006040 P00081142 M 26-35 6 B
545907 1006040 P00277642 M 26-35 6 B
545908 1006040 P00127642 M 26-35 6 B
545914 1006040 P00217442 M 26-35 6 B

24
Stay_In_Current_City_Years Marital_Status Product_Category_1 \
1 2 0 1
6 2 1 1
13 1 1 1
14 1 0 5
16 1 0 2
… … … …
545902 4+ 1 3
545904 2 0 6
545907 2 0 2
545908 2 0 1
545914 2 0 1

Product_Category_2 Product_Category_3 Purchase


1 6.0 14.0 15200
6 8.0 17.0 19215
13 2.0 5.0 15665
14 8.0 14.0 5378
16 3.0 4.0 13055
… … … …
545902 4.0 12.0 8047
545904 8.0 14.0 16493
545907 3.0 10.0 3425
545908 2.0 15.0 15694
545914 2.0 11.0 11640

[166821 rows x 12 columns]

[189]: #drop missing values on particular column


d=df.dropna(subset=["Product_Category_2","Product_Category_3"]).head(4)
d

[189]: User_ID Product_ID Gender Age Occupation City_Category \


1 1000001 P00248942 F 0-17 10 A
6 1000004 P00184942 M 46-50 7 B
13 1000005 P00145042 M 26-35 20 A
14 1000006 P00231342 F 51-55 9 A

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


1 2 0 1
6 2 1 1
13 1 1 1
14 1 0 5

Product_Category_2 Product_Category_3 Purchase


1 6.0 14.0 15200
6 8.0 17.0 19215

25
13 2.0 5.0 15665
14 8.0 14.0 5378

[194]: #dropna(inplace = True) it will remove all rows containing NULL values from
# the original DataFrame and create a new dataframe

df.dropna(inplace=True)
df

[194]: User_ID Product_ID Gender Age Occupation City_Category \


1 1000001 P00248942 F 0-17 10 A
6 1000004 P00184942 M 46-50 7 B
13 1000005 P00145042 M 26-35 20 A
14 1000006 P00231342 F 51-55 9 A
16 1000006 P0096642 F 51-55 9 A
… … … … … … …
545902 1006039 P00064042 F 46-50 0 B
545904 1006040 P00081142 M 26-35 6 B
545907 1006040 P00277642 M 26-35 6 B
545908 1006040 P00127642 M 26-35 6 B
545914 1006040 P00217442 M 26-35 6 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


1 2 0 1
6 2 1 1
13 1 1 1
14 1 0 5
16 1 0 2
… … … …
545902 4+ 1 3
545904 2 0 6
545907 2 0 2
545908 2 0 1
545914 2 0 1

Product_Category_2 Product_Category_3 Purchase


1 6.0 14.0 15200
6 8.0 17.0 19215
13 2.0 5.0 15665
14 8.0 14.0 5378
16 3.0 4.0 13055
… … … …
545902 4.0 12.0 8047
545904 8.0 14.0 16493
545907 3.0 10.0 3425
545908 2.0 15.0 15694

26
545914 2.0 11.0 11640

[166821 rows x 12 columns]

21 Fillna
The fillna() method replaces the NULL values with a specified value.

[215]: df=pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\test.csv")


df

[215]: name id
0 sd NaN
1 NaN 987.0
2 fg 23.0
3 NaN NaN
4 gh 52.0

[313]: # if you want to fill all the missing values from a single variable

df.fillna(222222)

[313]: name id
0 sd 222222.0
1 222222 987.0
2 fg 23.0
3 222222 222222.0
4 gh 52.0

[316]: #If you want to fill the particular column null values
#with particular data then you have to use dictionary

df.fillna({'name':'santosh','id': 34})

[316]: name id
0 sd NaN
1 santosh 987.0
2 fg 23.0
3 santosh NaN
4 gh 52.0

[314]: # bfill() method replaces the NULL values with the value from the forward row
df.fillna(method ="bfill")

[314]: name id
0 sd 987.0
1 fg 987.0

27
2 fg 23.0
3 gh 52.0
4 gh 52.0

[315]: # ffill() method replaces the NULL values with the value from the previous row
df.fillna(method ="ffill")

[315]: name id
0 sd NaN
1 sd 987.0
2 fg 23.0
3 fg 23.0
4 gh 52.0

22 Apply method
The apply() method allows you to apply a function along one of the axis of the DataFrame, default
0, which is the index (row) axis.
Syntax : dataframe.apply(func, axis, raw, result_type, args, kwds)

[287]: data = {
"ID": [50, 40, 30],
"EMPID": [300, 1112, 42]
}
df=pd.DataFrame(data)
print("orginal dataframe \n",df)
print()

def fun(y):
if y>1000:
return "NEW EMP"
else:
return "OLD EMP"

df["type"] =df["EMPID"].apply(lambda x: fun(x))


df

orginal dataframe
ID EMPID
0 50 300
1 40 1112
2 30 42

28
[287]: ID EMPID type
0 50 300 OLD EMP
1 40 1112 NEW EMP
2 30 42 OLD EMP

23 Replace
Replace() - method replaces the specified value with another specified value.

[727]: df =pd.DataFrame({'name':['sd',np.nan,'d',np.nan,'df'],'id':[1,np.nan,234,4,np.
↪nan]})

print(df)

#Syntax :replace( replace value , new value )


d=df.replace("sd","ram")
d

name id
0 sd 1.0
1 NaN NaN
2 d 234.0
3 NaN 4.0
4 df NaN

[727]: name id
0 ram 1.0
1 NaN NaN
2 d 234.0
3 NaN 4.0
4 df NaN

[297]: #replace a range to value with new values


df.replace(['sd','d'],99)

[297]: name id
0 99 1.0
1 NaN NaN
2 99 234.0
3 NaN 4.0
4 df NaN

[302]: #replace through regex


df.replace("[a-zA-Z]","santosh",regex=True)

[302]: name id
0 santoshsantosh 1.0
1 NaN NaN

29
2 santosh 234.0
3 NaN 4.0
4 santoshsantosh NaN

[303]: #replace a paricular column data through regex


df.replace({"name":"[a-z]"}, 22222, regex=True)

[303]: name id
0 22222.0 1.0
1 NaN NaN
2 22222.0 234.0
3 NaN 4.0
4 22222.0 NaN

24 Imputing the missing values


[742]: df

#mode is used in case of categorical features

[742]: name id
0 sd 1.000000
1 NaN 79.666667
2 d 234.000000
3 NaN 4.000000
4 df 79.666667

[741]: # Replacing with mean/median/mode


df["id"] =df["id"].fillna(df["id"].mean())
df

[741]: name id
0 sd 1.000000
1 NaN 79.666667
2 d 234.000000
3 NaN 4.000000
4 df 79.666667

25 Interpolate
The interpolate() method replaces the NULL values based on a specified method.
It will fill all the interger value column date but not fill any string values.
[308]: df=pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\test.csv")
df

30
[308]: name id
0 sd NaN
1 NaN 987.0
2 fg 23.0
3 NaN NaN
4 gh 52.0

[318]: df.interpolate()

[318]: name id
0 sd 987.0
1 fg 987.0
2 fg 23.0
3 gh 52.0
4 gh 52.0

[658]: df.interpolate(method='ffill')

[658]: A B
0 1.0 NaN
1 1.0 1.0
2 2.0 23.0
3 2.0 3.0
4 3.0 3.0

26 Merge
The merge() method updates the content of two DataFrame by merging them together, using the
specified method(s).
Its necessary u have common column to merge the data
[345]: df =pd.DataFrame({ "car":['MB','BMW','TATA','MS'],
"Model-1":[2012,2023,2012,2014]})

df1 =pd.DataFrame({ "car":['MB','BMW','TATA','BMW'],


"Model-2":[2013,2013,2022,2021]})

var =pd.merge(df,df1 ,on="car")


var

[345]: car Model-1 Model-2


0 MB 2012 2013
1 BMW 2023 2013
2 BMW 2023 2021
3 TATA 2012 2022

31
how – attributes helps us to specify how to merge the data
Like how =”left” same as left join
same are as “right”, “inner”, “cross”, “outer”
[339]: var =pd.merge(df,df1 ,how="right")
var

[339]: car Model-1 Model-2


0 MB 2012 2013
1 BMW 2023 2013
2 TATA 2012 2022
3 BMW 2023 2021

27 concat
concat() - The concat function is used to concatenate pandas objects
while concatenating on horizonatal both datframes has equal no of rows
while concatenating on vertical both datframes has equal no of columns
concat simply merge the data, it doesn’t merge on the basis of join
[342]: df =pd.DataFrame({ "car":['MB','BMW','TATA','MS'],
"Model-1":[2012,2023,2012,2014]})

df1 =pd.DataFrame({ "car":['MB','BMW','TATA','BMW'],


"Model-2":[2013,2013,2022,2021]})

#by default it will concate on the columns


var =pd.concat([df,df1])
var

[342]: car Model-1 Model-2


0 MB 2012.0 NaN
1 BMW 2023.0 NaN
2 TATA 2012.0 NaN
3 MS 2014.0 NaN
0 MB NaN 2013.0
1 BMW NaN 2013.0
2 TATA NaN 2022.0
3 BMW NaN 2021.0

[343]: #concat on rows


var =pd.concat([df,df1],axis =1)
var

32
[343]: car Model-1 car Model-2
0 MB 2012 MB 2013
1 BMW 2023 BMW 2013
2 TATA 2012 TATA 2022
3 MS 2014 BMW 2021

28 Join
The join() method inserts column(s) from another DataFrame, or Series.

[384]: import pandas as pd

var1 =[1,23,4,56,34]
var2 =[4,56,78,9,89]
var3 =[1,23,]
var4 =[11,23]

df = pd.DataFrame({"A":var1,"B":var2})
df2 = pd.DataFrame({"C":var3, "D": var4})

#join on the bases of outer join u can use inner, left ,right etc
data =df.join(df2, how ="outer")

data

[384]: A B C D
0 1 4 1.0 11.0
1 23 56 23.0 23.0
2 4 78 NaN NaN
3 56 9 NaN NaN
4 34 89 NaN NaN

29 append()
method appends a DataFrame-like object at the end of the current DataFrame.
[385]: var1 =[1,23,4,56,34]
var2 =[4,56,78,9,89]
var3 =[1,23,]
var4 =[11,23]

df1 = pd.DataFrame({"A":var1,"B":var2})
df2 = pd.DataFrame({"C":var3, "D": var4})

33
newdf = df1.append(df2)
newdf

C:\Users\sanram\AppData\Local\Temp\ipykernel_16444\1789055825.py:10:
FutureWarning: The frame.append method is deprecated and will be removed from
pandas in a future version. Use pandas.concat instead.
newdf = df1.append(df2)

[385]: A B C D
0 1.0 4.0 NaN NaN
1 23.0 56.0 NaN NaN
2 4.0 78.0 NaN NaN
3 56.0 9.0 NaN NaN
4 34.0 89.0 NaN NaN
0 NaN NaN 1.0 11.0
1 NaN NaN 23.0 23.0

30 GroupBy
It groupby() is used to spilit the data into groups based on some criteria
groupby method is used for grouping the data according to the categories & apply aggregate
function to the categories like :max, min , sum , avg etc
[572]: var1= ['sd','fd','kj','sr','ram','sd']
var2=[12,34,45,1,99,67]
var3 = ['maths','physics','chem','economics','bio','physics']

df =pd.DataFrame({'name' : var1, 'marks': var2 , 'sub':var3})


df

[572]: name marks sub


0 sd 12 maths
1 fd 34 physics
2 kj 45 chem
3 sr 1 economics
4 ram 99 bio
5 sd 67 physics

[581]: va1r =df.groupby("name")


va1r
##data is stored in below memory address lets fetch them

#lets access them

[581]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000227B8081EE0>

34
[580]: #group by on a particular column
va1r =df.groupby("name").groups
va1r

[580]: {'fd': [1], 'kj': [2], 'ram': [4], 'sd': [0, 5], 'sr': [3]}

[579]: #lets see whose gona score highest marks


df.groupby("name").agg('max')

#as likely you can use min, sum, avg etc

[579]: marks sub


name
fd 34 physics
kj 45 chem
ram 99 bio
sd 67 physics
sr 1 economics

31 The melt()
method reshapes the DataFrame into a long table with one row for each each column.
[386]: day_var =[1,2,3,4,5]
eng_marks =[30,65,45,678,91]
maths_marks =[55,34,56,70,66]
df =pd.DataFrame({"day":day_var,"english":eng_marks,"maths":maths_marks})
df

[386]: day english maths


0 1 30 55
1 2 65 34
2 3 45 56
3 4 678 70
4 5 91 66

[388]: #melt the data into vertically


pd.melt(df)

[388]: variable value


0 day 1
1 day 2
2 day 3
3 day 4
4 day 5
5 english 30

35
6 english 65
7 english 45
8 english 678
9 english 91
10 maths 55
11 maths 34
12 maths 56
13 maths 70
14 maths 66

32 Pivot
its helps us to reshape the dataframe
[390]: day_var =[1,2,3,4,5]
eng_marks =[30,65,45,678,91]
maths_marks =[55,34,56,70,66]
st_name =['sd','df','ram','sd','df']
df =pd.DataFrame({"day":day_var,"Stu_name":st_name, "english":eng_marks,"maths":
↪maths_marks})

df

[390]: day Stu_name english maths


0 1 sd 30 55
1 2 df 65 34
2 3 ram 45 56
3 4 sd 678 70
4 5 df 91 66

[391]: df.pivot(index="day",columns="Stu_name")

[391]: english maths


Stu_name df ram sd df ram sd
day
1 NaN NaN 30.0 NaN NaN 55.0
2 65.0 NaN NaN 34.0 NaN NaN
3 NaN 45.0 NaN NaN 56.0 NaN
4 NaN NaN 678.0 NaN NaN 70.0
5 91.0 NaN NaN 66.0 NaN NaN

33 Date Range
pandas daterange is useful in creating range of times or date
it is mainely used in reindexing our datetime index.
Syntax : pd.date_range(start_time , end_time)

36
B -business day frequency
C -custom business day frequency
D -calendar day frequency
W -weekly frequency
M -month end frequency
SM -semi-month end frequency (15th and end of month)
BM -business month end frequency
H -hourly frequency
T / min - minutely frequency
S -secondly frequency
ms -milliseconds
3H - is 3 hour difference
[435]: df = pd.date_range(start ='2020-01-01' , end='2020-01-02 ', freq ='4T')
df

[435]: DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:04:00',


'2020-01-01 00:08:00', '2020-01-01 00:12:00',
'2020-01-01 00:16:00', '2020-01-01 00:20:00',
'2020-01-01 00:24:00', '2020-01-01 00:28:00',
'2020-01-01 00:32:00', '2020-01-01 00:36:00',

'2020-01-01 23:24:00', '2020-01-01 23:28:00',
'2020-01-01 23:32:00', '2020-01-01 23:36:00',
'2020-01-01 23:40:00', '2020-01-01 23:44:00',
'2020-01-01 23:48:00', '2020-01-01 23:52:00',
'2020-01-01 23:56:00', '2020-01-02 00:00:00'],
dtype='datetime64[ns]', length=361, freq='4T')

[429]: print(type(df))

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

[439]: #creating date range with 10 equal periods

date_periods = pd.date_range(start ="01/02/2023 00:00:00", end="01/02/2023 00:


↪24:00" , periods =10)

date_periods

#it divide equal time inteval between two date range

37
[439]: DatetimeIndex(['2023-01-02 00:00:00', '2023-01-02 00:02:40',
'2023-01-02 00:05:20', '2023-01-02 00:08:00',
'2023-01-02 00:10:40', '2023-01-02 00:13:20',
'2023-01-02 00:16:00', '2023-01-02 00:18:40',
'2023-01-02 00:21:20', '2023-01-02 00:24:00'],
dtype='datetime64[ns]', freq=None)

34 Count method()
Count the number of (not NULL) values in each row:

[ ]: data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})

#count each columns & rows number


print(df.count())

# count values in name column


print(data['name'].value_counts()['sravan'])

# count values in subjects column


print(data['subjects'].value_counts()['php'])

# count values in marks column


print(data['marks'].value_counts()[89])

35 isna() / isnull()
It helps us to detect NA vallues
[473]: data = pd.DataFrame({
'name': ['sravan', 'ojsawi', 'bobby', 'rohith',
'gnanesh', 'sravan', 'sravan', 'ojaswi'],
'subjects': ['java', 'php', 'java', 'php', 'java',
'html/css', 'python', 'R'],
'marks': [98, 90, 78, 91, 87, 78, 89, 90],
'age': [11, 23, 23, 21, 21, 21, 23, 21]
})

#checking null values

38
print(data.isna()) #or #print(data.isnull()) both are same

#cheking null values in particular column


print(data["name"].isna())

name subjects marks age


0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
Name: name, dtype: bool

[478]: df = pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\pandas␣


↪case study\\blackfriday.csv")

df.isnull()

[478]: User_ID Product_ID Gender Age Occupation City_Category \


0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
… … … … … … …
550063 False False False False False False
550064 False False False False False False
550065 False False False False False False
550066 False False False False False False
550067 False False False False False False

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 False False False
1 False False False
2 False False False

39
3 False False False
4 False False False
… … … …
550063 False False False
550064 False False False
550065 False False False
550066 False False False
550067 False False False

Product_Category_2 Product_Category_3 Purchase


0 True True False
1 False False False
2 True True False
3 False True False
4 True True False
… … … …
550063 True True False
550064 True True False
550065 True True False
550066 True True False
550067 True True False

[550068 rows x 12 columns]

[481]: df.isnull().sum()

[481]: User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 173638
Product_Category_3 383247
Purchase 0
dtype: int64

[495]: df.dropna(inplace=True)
df

[495]: User_ID Product_ID Gender Age Occupation City_Category \


1 1000001 P00248942 F 0-17 10 A
6 1000004 P00184942 M 46-50 7 B
13 1000005 P00145042 M 26-35 20 A

40
14 1000006 P00231342 F 51-55 9 A
16 1000006 P0096642 F 51-55 9 A
… … … … … … …
545902 1006039 P00064042 F 46-50 0 B
545904 1006040 P00081142 M 26-35 6 B
545907 1006040 P00277642 M 26-35 6 B
545908 1006040 P00127642 M 26-35 6 B
545914 1006040 P00217442 M 26-35 6 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


1 2 0 1
6 2 1 1
13 1 1 1
14 1 0 5
16 1 0 2
… … … …
545902 4+ 1 3
545904 2 0 6
545907 2 0 2
545908 2 0 1
545914 2 0 1

Product_Category_2 Product_Category_3 Purchase


1 6.0 14.0 15200
6 8.0 17.0 19215
13 2.0 5.0 15665
14 8.0 14.0 5378
16 3.0 4.0 13055
… … … …
545902 4.0 12.0 8047
545904 8.0 14.0 16493
545907 3.0 10.0 3425
545908 2.0 15.0 15694
545914 2.0 11.0 11640

[166821 rows x 12 columns]

[508]: df.isnull().sum()

[508]: User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0

41
Product_Category_1 0
Product_Category_2 0
Product_Category_3 0
Purchase 0
dtype: int64

36 Filter
it helps us to Access group of rows and columns by same matched
[510]: #filter in columns
df.filter(like ="Product", axis=1)

[510]: Product_ID Product_Category_1 Product_Category_2 Product_Category_3


1 P00248942 1 6.0 14.0
6 P00184942 1 8.0 17.0
13 P00145042 1 2.0 5.0
14 P00231342 5 8.0 14.0
16 P0096642 2 3.0 4.0
… … … … …
545902 P00064042 3 4.0 12.0
545904 P00081142 6 8.0 14.0
545907 P00277642 2 3.0 10.0
545908 P00127642 1 2.0 15.0
545914 P00217442 1 2.0 11.0

[166821 rows x 4 columns]

[512]: #filter in rows


df.filter(like ="99", axis=0)

[512]: User_ID Product_ID Gender Age Occupation City_Category \


544999 1005888 P00126242 M 26-35 20 B
545099 1005915 P00015642 M 18-25 4 C
545299 1005950 P00177442 M 26-35 4 B
545399 1005960 P00085142 F 46-50 0 C
545499 1005978 P00183842 M 36-45 1 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


544999 1 0 2
545099 0 0 8
545299 2 1 1
545399 1 1 5
545499 2 0 4

Product_Category_2 Product_Category_3 Purchase


544999 4.0 5.0 6895

42
545099 16.0 17.0 8093
545299 6.0 8.0 15256
545399 13.0 14.0 6950
545499 9.0 12.0 1392

37 Copy()
The copy() method returns a copy of the DataFrame. By default, the copy is a “deep copy” meaning
that
any changes made in the original DataFrame will NOT be reflected in the copy.
[542]: df1 = df.copy()
df1

[542]: User_ID Product_ID Gender Age Occupation City_Category \


1 1000001 P00248942 F 0-17 10 A
6 1000004 P00184942 M 46-50 7 B
13 1000005 P00145042 M 26-35 20 A
14 1000006 P00231342 F 51-55 9 A
16 1000006 P0096642 F 51-55 9 A

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


1 2 0 1
6 2 1 1
13 1 1 1
14 1 0 5
16 1 0 2

Product_Category_2 Product_Category_3 Purchase


1 6.0 14.0 15200
6 8.0 17.0 19215
13 2.0 5.0 15665
14 8.0 14.0 5378
16 3.0 4.0 13055

[ ]: #shallow copy
In shallow copy if we changes this in the new dataframe
then changes get reflected in the orginal dataframe

[640]: df3 =df.copy(deep=False)


df3.head()

[640]: User_ID Product_ID Gender Age Occupation City_Category \


6 1000004 P00184942 M 46-50 7.0 B
13 1000005 P00145042 M 26-35 20.0 A
18 1000007 P00036842 M 36-45 1.0 B

43
19 1000008 P00249542 M 26-35 12.0 C
24 1000008 P00303442 M 26-35 12.0 C

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


6 2 1.0 1
13 1 1.0 1
18 1 1.0 1
19 4+ 1.0 1
24 4+ 1.0 1

Product_Category_2 Product_Category_3 Purchase


6 8.0 17.0 19215
13 2.0 5.0 15665
18 14.0 16.0 11788
19 5.0 15.0 19614
24 8.0 14.0 11927

38 Data Cleaning
[582]: df =pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\pandas␣
↪case study\\blackfriday.csv")

df

[582]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 P00069042 F 0-17 10 A
1 1000001 P00248942 F 0-17 10 A
2 1000001 P00087842 F 0-17 10 A
3 1000001 P00085442 F 0-17 10 A
4 1000002 P00285442 M 55+ 16 C
… … … … … … …
550063 1006033 P00372445 M 51-55 13 B
550064 1006035 P00375436 F 26-35 1 C
550065 1006036 P00375436 F 26-35 15 B
550066 1006038 P00375436 F 55+ 1 C
550067 1006039 P00371644 F 46-50 0 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 0 3
1 2 0 1
2 2 0 12
3 2 0 12
4 4+ 0 8
… … … …
550063 1 1 20
550064 3 0 20
550065 4+ 1 20

44
550066 2 0 20
550067 4+ 1 20

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370
1 6.0 14.0 15200
2 NaN NaN 1422
3 14.0 NaN 1057
4 NaN NaN 7969
… … … …
550063 NaN NaN 368
550064 NaN NaN 371
550065 NaN NaN 137
550066 NaN NaN 365
550067 NaN NaN 490

[550068 rows x 12 columns]

[584]: #basics statics


df.describe()

[584]: User_ID Occupation Marital_Status Product_Category_1 \


count 5.500680e+05 550068.000000 550068.000000 550068.000000
mean 1.003029e+06 8.076707 0.409653 5.404270
std 1.727592e+03 6.522660 0.491770 3.936211
min 1.000001e+06 0.000000 0.000000 1.000000
25% 1.001516e+06 2.000000 0.000000 1.000000
50% 1.003077e+06 7.000000 0.000000 5.000000
75% 1.004478e+06 14.000000 1.000000 8.000000
max 1.006040e+06 20.000000 1.000000 20.000000

Product_Category_2 Product_Category_3 Purchase


count 376430.000000 166821.000000 550068.000000
mean 9.842329 12.668243 9263.968713
std 5.086590 4.125338 5023.065394
min 2.000000 3.000000 12.000000
25% 5.000000 9.000000 5823.000000
50% 9.000000 14.000000 8047.000000
75% 15.000000 16.000000 12054.000000
max 18.000000 18.000000 23961.000000

[585]: #random 10 rows


df.sample(10)

[585]: User_ID Product_ID Gender Age Occupation City_Category \


51382 1001860 P00006142 M 55+ 16 C
546903 1001456 P00371644 M 26-35 2 C

45
432173 1000543 P00205642 M 26-35 5 B
389912 1006004 P00184242 F 26-35 15 C
172140 1002624 P00182242 F 36-45 0 A
218182 1003661 P00113342 M 36-45 12 C
101932 1003752 P00220342 F 18-25 1 B
438737 1001545 P00178642 M 26-35 20 A
534915 1004351 P00340642 M 26-35 12 C
549347 1004992 P00371644 F 26-35 2 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


51382 1 0 8
546903 1 0 20
432173 4+ 1 5
389912 2 0 9
172140 1 1 1
218182 0 1 1
101932 4+ 0 5
438737 1 0 5
534915 2 0 5
549347 2 0 20

Product_Category_2 Product_Category_3 Purchase


51382 NaN NaN 6127
546903 NaN NaN 136
432173 8.0 NaN 6857
389912 15.0 NaN 13829
172140 5.0 6.0 15905
218182 8.0 17.0 11655
101932 NaN NaN 7065
438737 15.0 NaN 5232
534915 NaN NaN 1959
549347 NaN NaN 363

[586]: df.isnull().sum()

[586]: User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 0
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 0
Product_Category_1 0
Product_Category_2 173638
Product_Category_3 383247
Purchase 0

46
dtype: int64

[591]: df_list= list(df)


df_list

[591]: ['User_ID',
'Product_ID',
'Gender',
'Age',
'Occupation',
'City_Category',
'Stay_In_Current_City_Years',
'Marital_Status',
'Product_Category_1',
'Product_Category_2',
'Product_Category_3',
'Purchase']

[609]: #count on nan value


df.isnull().sum()

[609]: User_ID 0
Product_ID 0
Gender 0
Age 0
Occupation 69638
City_Category 0
Stay_In_Current_City_Years 0
Marital_Status 324731
Product_Category_1 0
Product_Category_2 173638
Product_Category_3 383247
Purchase 0
dtype: int64

[612]: #converting O values to nan

df[df_list[0:12]] = df[df_list[0:12]].replace(0,np.nan)
df

[612]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 P00069042 F 0-17 10.0 A
1 1000001 P00248942 F 0-17 10.0 A
2 1000001 P00087842 F 0-17 10.0 A
3 1000001 P00085442 F 0-17 10.0 A
4 1000002 P00285442 M 55+ 16.0 C
… … … … … … …

47
550063 1006033 P00372445 M 51-55 13.0 B
550064 1006035 P00375436 F 26-35 1.0 C
550065 1006036 P00375436 F 26-35 15.0 B
550066 1006038 P00375436 F 55+ 1.0 C
550067 1006039 P00371644 F 46-50 NaN B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 NaN 3
1 2 NaN 1
2 2 NaN 12
3 2 NaN 12
4 4+ NaN 8
… … … …
550063 1 1.0 20
550064 3 NaN 20
550065 4+ 1.0 20
550066 2 NaN 20
550067 4+ 1.0 20

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370
1 6.0 14.0 15200
2 NaN NaN 1422
3 14.0 NaN 1057
4 NaN NaN 7969
… … … …
550063 NaN NaN 368
550064 NaN NaN 371
550065 NaN NaN 137
550066 NaN NaN 365
550067 NaN NaN 490

[550068 rows x 12 columns]

[620]: print("before inplace", df.shape)


df.dropna(inplace=True)
print("after inplace", df.shape)

before inplace (550068, 12)


after inplace (58507, 12)

[621]: df

[621]: User_ID Product_ID Gender Age Occupation City_Category \


6 1000004 P00184942 M 46-50 7.0 B
13 1000005 P00145042 M 26-35 20.0 A
18 1000007 P00036842 M 36-45 1.0 B

48
19 1000008 P00249542 M 26-35 12.0 C
24 1000008 P00303442 M 26-35 12.0 C
… … … … … … …
545885 1006036 P00207342 F 26-35 15.0 B
545887 1006036 P00127742 F 26-35 15.0 B
545888 1006036 P00196042 F 26-35 15.0 B
545889 1006036 P00129342 F 26-35 15.0 B
545890 1006036 P00244142 F 26-35 15.0 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


6 2 1.0 1
13 1 1.0 1
18 1 1.0 1
19 4+ 1.0 1
24 4+ 1.0 1
… … … …
545885 4+ 1.0 5
545887 4+ 1.0 1
545888 4+ 1.0 4
545889 4+ 1.0 1
545890 4+ 1.0 1

Product_Category_2 Product_Category_3 Purchase


6 8.0 17.0 19215
13 2.0 5.0 15665
18 14.0 16.0 11788
19 5.0 15.0 19614
24 8.0 14.0 11927
… … … …
545885 8.0 14.0 3706
545887 2.0 15.0 11398
545888 9.0 15.0 2852
545889 5.0 15.0 7830
545890 2.0 15.0 7846

[58507 rows x 12 columns]

[622]: df.isnull().mean()

[622]: User_ID 0.0


Product_ID 0.0
Gender 0.0
Age 0.0
Occupation 0.0
City_Category 0.0
Stay_In_Current_City_Years 0.0
Marital_Status 0.0

49
Product_Category_1 0.0
Product_Category_2 0.0
Product_Category_3 0.0
Purchase 0.0
dtype: float64

[676]: df =pd.read_csv("C:\\Users\\sanram\\Videos\\Jupyter python practise\\pandas␣


↪case study\\blackfriday.csv")

df

[676]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 P00069042 F 0-17 10 A
1 1000001 P00248942 F 0-17 10 A
2 1000001 P00087842 F 0-17 10 A
3 1000001 P00085442 F 0-17 10 A
4 1000002 P00285442 M 55+ 16 C
… … … … … … …
550063 1006033 P00372445 M 51-55 13 B
550064 1006035 P00375436 F 26-35 1 C
550065 1006036 P00375436 F 26-35 15 B
550066 1006038 P00375436 F 55+ 1 C
550067 1006039 P00371644 F 46-50 0 B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 0 3
1 2 0 1
2 2 0 12
3 2 0 12
4 4+ 0 8
… … … …
550063 1 1 20
550064 3 0 20
550065 4+ 1 20
550066 2 0 20
550067 4+ 1 20

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370
1 6.0 14.0 15200
2 NaN NaN 1422
3 14.0 NaN 1057
4 NaN NaN 7969
… … … …
550063 NaN NaN 368
550064 NaN NaN 371
550065 NaN NaN 137
550066 NaN NaN 365

50
550067 NaN NaN 490

[550068 rows x 12 columns]

[706]: var =df.groupby("Marital_Status").groups


print(list(var))

[0, 1]

[709]: def a(x):


if x==0:
return "good husband"
else:
return "bad husband"

df["Marital_Status"] = df["Marital_Status"].apply(a)

[710]: df

[710]: User_ID Product_ID Gender Age Occupation City_Category \


0 1000001 P00069042 F 0-17 bad A
1 1000001 P00248942 F 0-17 bad A
2 1000001 P00087842 F 0-17 bad A
3 1000001 P00085442 F 0-17 bad A
4 1000002 P00285442 M 55+ bad C
… … … … … … …
550063 1006033 P00372445 M 51-55 bad B
550064 1006035 P00375436 F 26-35 bad C
550065 1006036 P00375436 F 26-35 bad B
550066 1006038 P00375436 F 55+ bad C
550067 1006039 P00371644 F 46-50 bad B

Stay_In_Current_City_Years Marital_Status Product_Category_1 \


0 2 good husband 3
1 2 good husband 1
2 2 good husband 12
3 2 good husband 12
4 4+ good husband 8
… … … …
550063 1 bad husband 20
550064 3 good husband 20
550065 4+ bad husband 20
550066 2 good husband 20
550067 4+ bad husband 20

Product_Category_2 Product_Category_3 Purchase


0 NaN NaN 8370

51
1 6.0 14.0 15200
2 NaN NaN 1422
3 14.0 NaN 1057
4 NaN NaN 7969
… … … …
550063 NaN NaN 368
550064 NaN NaN 371
550065 NaN NaN 137
550066 NaN NaN 365
550067 NaN NaN 490

[550068 rows x 12 columns]

[713]: import seaborn as sea


sea.countplot(df["Marital_Status"])

C:\Users\sanram\Anaconda3\lib\site-packages\seaborn\_decorators.py:36:
FutureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing other
arguments without an explicit keyword will result in an error or
misinterpretation.
warnings.warn(

[713]: <AxesSubplot:xlabel='Marital_Status', ylabel='count'>

52
39 Pandas - Data Correlations
The corr() method calculates the relationship between each column in your data set.

[5]: import pandas as pd


df=pd.read_csv("C:\\Users\\sanram\Videos\\Jupyter python practise\\data.csv")
df

[5]: Duration Pulse Maxpulse Calories


0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. … … … …
164 60 105 140 290.8
165 60 110 145 300.0
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

[169 rows x 4 columns]

[6]: df.corr()

[6]: Duration Pulse Maxpulse Calories


Duration 1.000000 -0.155408 0.009403 0.922717
Pulse -0.155408 1.000000 0.786535 0.025121
Maxpulse 0.009403 0.786535 1.000000 0.203813
Calories 0.922717 0.025121 0.203813 1.000000

40 astype() Method
The astype() method returns a new DataFrame where
the data types has been changed to the specified type.
[8]: #converting int datatype of float
df.astype(dtype='float64')

[8]: Duration Pulse Maxpulse Calories


0 60.0 110.0 130.0 409.1
1 60.0 117.0 145.0 479.0
2 60.0 103.0 135.0 340.0

53
3 45.0 109.0 175.0 282.4
4 45.0 117.0 148.0 406.0
.. … … … …
164 60.0 105.0 140.0 290.8
165 60.0 110.0 145.0 300.0
166 60.0 115.0 145.0 310.2
167 75.0 120.0 150.0 320.4
168 75.0 125.0 150.0 330.4

[169 rows x 4 columns]

[ ]:

54

You might also like