0% found this document useful (0 votes)
67 views

Pandas - Jupyter Notebook

The document provides an overview of working with Pandas Series and DataFrames. It discusses how to create Series and DataFrames from various data types like lists, NumPy arrays, and dictionaries. It also covers common operations on Series like aggregate functions, absolute value, appending, and string methods. For DataFrames, it discusses slicing, adding/deleting columns and rows, transposing, statistical functions, renaming, sorting, and working with CSV files. The document aims to serve as a comprehensive reference for working with Pandas Series and DataFrames.

Uploaded by

disguisedacc511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Pandas - Jupyter Notebook

The document provides an overview of working with Pandas Series and DataFrames. It discusses how to create Series and DataFrames from various data types like lists, NumPy arrays, and dictionaries. It also covers common operations on Series like aggregate functions, absolute value, appending, and string methods. For DataFrames, it discusses slicing, adding/deleting columns and rows, transposing, statistical functions, renaming, sorting, and working with CSV files. The document aims to serve as a comprehensive reference for working with Pandas Series and DataFrames.

Uploaded by

disguisedacc511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

2/13/24, 2:41 PM Pandas - Jupyter Notebook

GeeksforGeeks

Pandas
In [6]:  import numpy as np
import pandas as pd

Table of Contents

1. Working with Pandas Series


----- a) Creating Series
Series through list
Series through Numpy Array
Setting up our own index
Series through dictionary
Using repeat function along with creating a Series
Accessing data from Series
----- b) Aggregate function on Pandas Series
----- c) Sereis Absolute Function
----- d) Appending Series
----- e) Astype Function
----- f) Between Functions
----- g) All strings functions can be used to extract or modify texts in a series
Upper and Lower Function
Len function
Strip Function
Split Function
Contains Function
Replace Function
Count Function
Startswith and Endswith Function
Find Finction
----- h) Converting a Series to List

2. Detailed Coding Implementations on Pandas DataFrame


-----a) Creating Data Frames
-----b) Slicing in DataFrames Using Iloc and Loc
Basic Loc Operations
Basic Iloc Operations
Slicing Using Conditions
-----c) Column Addition in DataFrames
Using List
Using Pandas Seires

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 1/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

Using an Existing Column


-----d) Deleting Column in DataFrame
Using del
Using pop function
----- e) Addition of rows
----- f) Drop function
----- g) Transposing a DataFrame
----- h) A set of more DataFrame Functionalities
axes function
ndim function
dtypes function
shape function
head function
tail function
empty function
----- i) Statistical or Mathematical Functions
Sum
Mean
Median
Mode
Variance
Min
Max
Standard Deviation
----- j) Describe Function
----- k) Pipe Functions:
Pipe function
Apply Function
Applymap Function
----- l) Reindex Function
----- m) Renaming Columns in Pandas DataFrame
----- n) Sorting in Pandas DataFrame
----- o) Groupby Functions
Adding Statistical Computation on groupby
Using Filter Function with Groupby

3. Working with csv files and basic data Analysis Using Pandas

-----a) Reading CSV


-----b) Info Function
-----c) isnull() Function
-----d) Quantile Function
-----e) Copy Function
-----f) Value Counts Function
-----g) Unique and Nunique functopn
-----h) dropna() function
-----i) fillna() fucntion
-----j) sample Functions
-----k) to_csv() functions

4. A detailed Pandas Profile Report

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 2/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

1. Working with Pandas Series

a) Creating Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integer, string, float, python objects, etc.). The axis labels are collectively called index.
Labels need not be unique but must be a hashable type. The object supports both integer
and label-based indexing and provides a host of methods for performing operations
involving the index.

Seires through list

In [ ]:  lst = [1,2,3,4,5]

pd.Series(lst)

Series through Numpy array

In [ ]:  arr = np.array([1,2,3,4,5])
pd.Series(arr)

Giving Index from our own end

In [12]:  pd.Series(index = ['Eshant', 'Pranjal', 'Jayesh', 'Ashish'], data = [1,2

Out[12]: Eshant 1
Pranjal 2
Jayesh 3
Ashish 4
dtype: int64

Series through Dictionary values.

In [15]:  steps = {'day1' : 4000, 'day2' : 3000, 'day3' : 12000}



pd.Series(steps)

Out[15]: day1 4000


day2 3000
day3 12000
dtype: int64

Using repeat function along with creating a Series

Pandas Series.repeat() function repeat elements of a Series. It returns a new Series where
each element of the current Series is repeated consecutively a given number of times.

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 3/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [19]:  pd.Series(5).repeat(3)

Out[19]: 0 5
0 5
0 5
dtype: int64

we can use the reset function to make the index accurate

In [27]:  pd.Series(5).repeat(3).reset_index(drop = True)

Out[27]: 0 5
1 5
2 5
dtype: int64

This code indicates:

10 should be repeated 5 times and


20 should be repeated 2 times

In [29]:  s = pd.Series([10,20]).repeat([5,2]).reset_index(drop = True)



s

Out[29]: 0 10
1 10
2 10
3 10
4 10
5 20
6 20
dtype: int64

Accessing elements

In [34]:  s[4]

Out[34]: 10

s[0] or s[50] something like this would not work becasue the we can access elements based
on the index which we procided

In [38]:  s[6]

Out[38]: 20

By last n numbers (start - end-1)

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 4/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [49]:  s[2:-2]

Out[49]: 2 10
3 10
4 10
dtype: int64

b) Aggregate function on pandas Series

Pandas Series.aggregate() function aggregate using one or more operations over the
specified axis in the given series object.

In [58]:  sr = pd.Series([1,2,3,4,5,6,7])

sr.agg([min,max,sum])

Out[58]: min 1
max 7
sum 28
dtype: int64

c) Series absolute function

Pandas Series.abs() method is used to get the absolute numeric value of each element in
Series/DataFrame.

In [60]:  sr = pd.Series([1,-2,3,-4,5,-6,7])

sr.abs()

Out[60]: 0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int64

d) Appending Series

Pandas Series.append() function is used to concatenate two or more series object.

Syntax: Series.append(to_append, ignore_index=False, verify_integrity=False)

Parameter : to_append : Series or list/tuple of Series ignore_index : If True, do not use the
index labels. verify_integrity : If True, raise Exception on creating index with duplicates

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 5/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [67]:  sr1 = pd.Series([1,-2,3])


sr2 = pd.Series([1,2,3])
sr3 = sr2.append(sr1)

sr3

Out[67]: 0 1
1 2
2 3
0 1
1 -2
2 3
dtype: int64

To make the index accurate:

In [71]:  sr3.reset_index(drop = True)

Out[71]: 0 1
1 2
2 3
3 1
4 -2
5 3
dtype: int64

e) Astype function

Pandas astype() is the one of the most important methods. It is used to change data type of
a series. When data frame is made from a csv file, the columns are imported and data type
is set automatically which many times is not what it actually should have.

In [75]:  sr1

Out[75]: 0 1
1 -2
2 3
dtype: int64

You can see below int64 is mentioned

In [76]:  type(sr1[0])

Out[76]: numpy.int64

Now you can see it is written as object

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 6/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [80]:  sr1.astype('float')

Out[80]: 0 1.0
1 -2.0
2 3.0
dtype: float64

f) Between Function

Pandas between() method is used on series to check which values lie between first and
second argument.

In [86]:  sr1 = pd.Series([1,2,30,4,5,6,7,8,9,20])


sr1

Out[86]: 0 1
1 2
2 30
3 4
4 5
5 6
6 7
7 8
8 9
9 20
dtype: int64

In [87]:  sr1.between(10,50)

Out[87]: 0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 True
dtype: bool

g) All strings functions can be used to extract or modify


texts in a series

Upper and Lower Function


Len function
Strip Function
Split Function
Contains Function
Replace Function
Count Function
Startswith and Endswith Function
Find Finction

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 7/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [88]:  ser = pd.Series(["Eshant Das" , "Data Science" , "Geeks for Geeks" , 'He

Upper and Lower Function

In [92]:  print(ser.str.upper())
print('-'*30)
print(ser.str.lower())

0 ESHANT DAS
1 DATA SCIENCE
2 GEEKS FOR GEEKS
3 HELLO WORLD
4 MACHINE LEARNING
dtype: object
------------------------------
0 eshant das
1 data science
2 geeks for geeks
3 hello world
4 machine learning
dtype: object

Length function

In [94]:  for i in ser:


print(len(i))

10
12
15
11
16

Strip Function

In [95]:  ser = pd.Series([" Eshant Das" , "Data Science" , "Geeks for Geeks" , '

for i in ser:
print(i , len(i))

Eshant Das 12
Data Science 12
Geeks for Geeks 15
Hello World 11
Machine Learning 18

2 extra spaces has been removed

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 8/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [96]:  ser = ser.str.strip()



for i in ser:
print(i , len(i))

Eshant Das 10
Data Science 12
Geeks for Geeks 15
Hello World 11
Machine Learning 16

Split Function

In [108]:  ser.str.split()

Out[108]: 0 [Eshant, Das]


1 [Data, Science]
2 [Geeks, for, Geeks]
3 [Hello, World]
4 [Machine, Learning]
dtype: object

IF we want to split onlt the first world of every string in the pandas series

In [109]:  ser.str.split()[0]

Out[109]: ['Eshant', 'Das']

For second word

In [110]:  ser.str.split()[1]

Out[110]: ['Data', 'Science']

Contains Function

In [126]:  ser = pd.Series(["Eshant Das","Data@Science","Geeks for Geeks",'Hello@Wo



ser.str.contains('@')

Out[126]: 0 False
1 True
2 False
3 True
4 False
dtype: bool

Replace Function

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 9/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [127]:  ser.str.replace('@',' ')

Out[127]: 0 Eshant Das


1 Data Science
2 Geeks for Geeks
3 Hello World
4 Machine Learning
dtype: object

Count Function

In [128]:  ser.str.count('a')

Out[128]: 0 2
1 2
2 0
3 0
4 2
dtype: int64

startswith and endswith

In [129]:  ser.str.startswith('D')

Out[129]: 0 False
1 True
2 False
3 False
4 False
dtype: bool

In [130]:  ser.str.endswith('s')

Out[130]: 0 True
1 False
2 True
3 False
4 False
dtype: bool

Find Function

In [133]:  ser.str.find('Geeks')

Out[133]: 0 -1
1 -1
2 0
3 -1
4 -1
dtype: int64

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 10/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

h) Converting a Series to List

Pandas tolist() is used to convert a series to list. Initially the series is of type
pandas.core.series.

In [137]:  ser.to_list()

Out[137]: ['Eshant Das',


'Data@Science',
'Geeks for Geeks',
'Hello@World',
'Machine Learning']

2. Detailed Coding Implementations on Pandas


DataFrame

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data


structure with labeled axes (rows and columns). A Data frame is a two-dimensional data
structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame
consists of three principal components, the data, rows, and columns.

a) Creating Data Frames

In the real world, a Pandas DataFrame will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be
created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created
in different ways here are some ways by which we create a dataframe:

Creating a dataframe using List:


localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 11/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

DataFrame can be created using a single list or a list of lists.

In [161]:  lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']



pd.DataFrame(lst)

Out[161]: 0

0 Geeks

1 For

2 Geeks

3 is

4 portal

5 for

6 Geeks

In [163]:  lst = [['tom',10],['jerry',12],['spike',14]]



pd.DataFrame(lst)

Out[163]: 0 1

0 tom 10

1 jerry 12

2 spike 14

Creating DataFrame from dict of ndarray/lists:

To create DataFrame from dict of narray/list, all the narray must be of same length. If index
is passed then the length index should be equal to the length of arrays. If no index is
passed, then by default, index will be range(n) where n is the array length.

In [166]:  data = {'name':['Tom', 'nick', 'krish', 'jack'], 'age':[20, 21, 19, 18]}


pd.DataFrame(data)

Out[166]: name age

0 Tom 20

1 nick 21

2 krish 19

3 jack 18

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular


fashion in rows and columns. We can perform basic operations on rows/columns like
selecting, deleting, adding, and renaming.

Column Selection: In Order to select a column in Pandas DataFrame, we can either access
the columns by calling them by their columns name.

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 12/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [169]:  data = { 'Name' :['Jai', 'Princi', 'Gaurav', 'Anuj'],


'Age' :[27, 24, 22, 32],
'Address' :['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

df = pd.DataFrame(data)

df[['Name', 'Qualification']]

Out[169]: Name Qualification

0 Jai Msc

1 Princi MA

2 Gaurav MCA

3 Anuj Phd

b) Slicing in DataFrames Using iloc and loc

Pandas comprises many methods for its proper functioning. loc() and iloc() are one of those
methods. These are used in slicing data from the Pandas DataFrame. They help in the
convenient selection of data from the DataFrame in Python. They are used in filtering the
data according to some conditions.

In [171]:  data = {'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three' : pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)
df

Out[171]: one two three four

0 1 10 100 1000

1 2 20 200 2000

2 3 30 300 3000

3 4 40 400 4000

Basic loc Operations

Python loc() function The loc() function is label based data selecting method which means
that we have to pass the name of the row or column which we want to select. This method
includes the last element of the range passed in it, unlike iloc(). loc() can accept the boolean
data unlike iloc(). Many operations can be performed using the loc() method like

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 13/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [180]:  df.loc[1:2, 'two' : 'three']

Out[180]: two three

1 20 200

2 30 300

Basic iloc Operations

The iloc() function is an indexed-based selecting method which means that we have to pass
an integer index in the method to select a specific row/column. This method does not
include the last element of the range passed in it unlike loc(). iloc() does not accept the
boolean data unlike loc().

In [192]:  df.iloc[1 : -1, 1:-1 ]

Out[192]: two three

1 20 200

2 30 300

you can see index 3 of both row and column has not been added here so 1 was
inclusize but 3 is exclusive in the case of ilocs

Let's see another example

In [195]:  df.iloc[:,2:3]

Out[195]: three

0 100

1 200

2 300

3 400

Selecting Spefic Rows

In [197]:  df.iloc[[0,2],[1,3]]

Out[197]: two four

0 10 1000

2 30 3000

c) Slicing Using Conditions

Using Conditions works with loc basically

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 14/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [204]:  df.loc[df['two'] > 20, ['three','four']]

Out[204]: three four

2 300 3000

3 400 4000

So we could extract only those data for which the value is more than 20
For the columns we have used comma(,) to extract specifc columns which is 'three' and
'four'

Let's see another example

In [207]:  df.loc[df['three'] < 300, ['one','four']]

Out[207]: one four

0 1 1000

1 2 2000

So you can get the inference in the same way for this code as we got for the previous
code

c) Column Addition in DataFrame

In [208]:  df

Out[208]: one two three four

0 1 10 100 1000

1 2 20 200 2000

2 3 30 300 3000

3 4 40 400 4000

We can add a column in many ways. Let us discuss three ways how we can add column
here

Using List
Using Pandas Series
Using an existing Column(we can modify that column in the way we want and that
modified part can also be displayed)

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 15/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [210]:  l = [22,33,44,55]
df['five'] = l
df

Out[210]: one two three four five

0 1 10 100 1000 22

1 2 20 200 2000 33

2 3 30 300 3000 44

3 4 40 400 4000 55

In [211]:  sr = pd.Series([111,222,333,444])
df['six'] = sr
df

Out[211]: one two three four five six

0 1 10 100 1000 22 111

1 2 20 200 2000 33 222

2 3 30 300 3000 44 333

3 4 40 400 4000 55 444

Using an existing Column

In [216]:  df['seven'] = df['one'] + 10


df

Out[216]: one two three four five six seven

0 1 10 100 1000 22 111 11

1 2 20 200 2000 33 222 12

2 3 30 300 3000 44 333 13

3 4 40 400 4000 55 444 14

Now we can see the column 7 is having all the values of column 1 increamented by 10

d) Column Deletion in Dataframes

In [217]:  df

Out[217]: one two three four five six seven

0 1 10 100 1000 22 111 11

1 2 20 200 2000 33 222 12

2 3 30 300 3000 44 333 13

3 4 40 400 4000 55 444 14

Using del

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 16/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

You can see that the column which had the name 'six' has been deleted

In [218]:  del df['six']



df

Out[218]: one two three four five seven

0 1 10 100 1000 22 11

1 2 20 200 2000 33 12

2 3 30 300 3000 44 13

3 4 40 400 4000 55 14

Using pop

You can see that the columm five has also been deleted from our dataframe

In [220]:  df.pop('five')

df

Out[220]: one two three four seven

0 1 10 100 1000 11

1 2 20 200 2000 12

2 3 30 300 3000 13

3 4 40 400 4000 14

e) Addition of rows

In a Pandas DataFrame, you can add rows by using the append method. You can also
create a new DataFrame with the desired row values and use the append to add the new
row to the original dataframe. Here's an example of adding a single row to a dataframe:

In [228]:  df1 = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])


df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])


df3 = df1.append(df2).reset_index(drop = True)

df3

Out[228]: a b

0 1 2

1 3 4

2 5 6

3 7 8

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 17/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

f) Pandas drop function

Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric Python packages. Pandas is one of those packages and makes
importing and analyzing data much easier.

Pandas provide data analysts a way to delete and filter data frame using .drop() method.
Rows or columns can be removed using index label or column name using this method.

Syntax: DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None,


inplace=False, errors=’raise’)

Parameters:

labels: String or list of strings referring row or column name. axis: int or string value, 0 ‘index’
for Rows and 1 ‘columns’ for Columns. index or columns: Single label or list. index or
columns are an alternative to axis and cannot be used together. level: Used to specify level
in case data frame is having multiple level index. inplace: Makes changes in original Data
Frame if True. errors: Ignores error if any value from the list doesn’t exists and drops rest of
the values when errors = ‘ignore’

Return type: Dataframe with dropped values

In [240]:  data = { 'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three' : pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}


df = pd.DataFrame(data)
df

Out[240]: one two three four

0 1 10 100 1000

1 2 20 200 2000

2 3 30 300 3000

3 4 40 400 4000

axis =0 => Rows (row wise)

In [241]:  df.drop([0,1], axis = 0, inplace = True)


df

Out[241]: one two three four

2 3 30 300 3000

3 4 40 400 4000

axis =1 => Columns (column wise)

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 18/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [242]:  df.drop(['one','three'], axis = 1, inplace = True)


df

Out[242]: two four

2 30 3000

3 40 4000

g) Transposing a DataFrame

The .T attribute in a Pandas DataFrame is used to transpose the dataframe, i.e., to flip the
rows and columns. The result of transposing a dataframe is a new dataframe with the
original rows as columns and the original columns as rows.

Here's an example to illustrate the use of the .T attribute:

In [243]:  data = { 'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three' : pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)
df

Out[243]: one two three four

0 1 10 100 1000

1 2 20 200 2000

2 3 30 300 3000

3 4 40 400 4000

In [244]:  df.T

Out[244]: 0 1 2 3

one 1 2 3 4

two 10 20 30 40

three 100 200 300 400

four 1000 2000 3000 4000

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 19/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

h) A set of more DataFrame Functionalities

In [245]:  df

Out[245]: one two three four

0 1 10 100 1000

1 2 20 200 2000

2 3 30 300 3000

3 4 40 400 4000

1. axes function

The .axes attribute in a Pandas DataFrame returns a list with the row and column labels of
the DataFrame. The first element of the list is the row labels (index), and the second
element is the column labels.

In [246]:  df.axes

Out[246]: [RangeIndex(start=0, stop=4, step=1),


Index(['one', 'two', 'three', 'four'], dtype='object')]

2. ndim function

The .ndim attribute in a Pandas DataFrame returns the number of dimensions of the
dataframe, which is always 2 for a DataFrame (row-and-column format).

In [247]:  df.ndim

Out[247]: 2

3. dtypes

The .dtypes attribute in a Pandas DataFrame returns the data types of the columns in the
DataFrame. The result is a Series with the column names as index and the data types of the
columns as values.

In [248]:  df.dtypes

Out[248]: one int64


two int64
three int64
four int64
dtype: object

4. shape function

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 20/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

The .shape attribute in a Pandas DataFrame returns the dimensions (number of rows,
number of columns) of the DataFrame as a tuple.

In [249]:  df.shape

Out[249]: (4, 4)

4 rows
4 columns

5. head() function

In [250]:  d = { 'Name' :pd.Series(['Tom','Jerry','Spike','Popeye','Olive','Bluto'


'Age' :pd.Series([10,12,14,30,28,33,15]),
'Height':pd.Series([3.25,1.11,4.12,5.47,6.15,6.67,2.61])}

df = pd.DataFrame(d)
df

Out[250]: Name Age Height

0 Tom 10 3.25

1 Jerry 12 1.11

2 Spike 14 4.12

3 Popeye 30 5.47

4 Olive 28 6.15

5 Bluto 33 6.67

6 Mickey 15 2.61

The .head() method in a Pandas DataFrame returns the first n rows (by default, n=5) of the
DataFrame. This method is useful for quickly examining the first few rows of a large
DataFrame to get a sense of its structure and content.

In [259]:  df.head(3)

Out[259]: Name Age Height

0 Tom 10 3.25

1 Jerry 12 1.11

2 Spike 14 4.12

By default it will display first 5 rows


We can mention the number of starting rows we want to see
We will see this function more often furthur since the dataframe is so small at this point
so we cannot use something like df.head(20)

6. df.tail() function

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 21/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

The .tail() method in a Pandas DataFrame returns the last n rows (by default, n=5) of the
DataFrame. This method is useful for quickly examining the last few rows of a large
DataFrame to get a sense of its structure and content.

In [260]:  df.tail(3)

Out[260]: Name Age Height

4 Olive 28 6.15

5 Bluto 33 6.67

6 Mickey 15 2.61

7. empty function()

The .empty attribute in a Pandas DataFrame returns a Boolean value indicating whether the
DataFrame is empty or not. A DataFrame is considered empty if it has no rows.

In [263]:  df = pd.DataFrame()

df.empty

Out[263]: True

i) Statistical or Mathematical Functions

Sum Mean Median Mode Variance


Min Max Standard Deviation

In [264]:  data = {'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three' : pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)
df

Out[264]: one two three four

0 1 10 100 1000

1 2 20 200 2000

2 3 30 300 3000

3 4 40 400 4000

1. Sum

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 22/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [266]:  df.sum()

Out[266]: one 10
two 100
three 1000
four 10000
dtype: int64

2. Mean

In [267]:  df.mean()

Out[267]: one 2.5


two 25.0
three 250.0
four 2500.0
dtype: float64

3. Median

In [269]:  df.median()

Out[269]: one 2.5


two 25.0
three 250.0
four 2500.0
dtype: float64

4. Mode

In [277]:  de = pd.DataFrame({'A': [1, 2, 2, 3, 4, 4, 4, 5], 'B': [10, 20, 20, 30,



print('A' , de['A'].mode())
print('B' , de['B'].mode())

A 0 4
dtype: int64
B 0 20
1 40
dtype: int64

5. Variance

In [279]:  df.var()

Out[279]: one 1.666667e+00


two 1.666667e+02
three 1.666667e+04
four 1.666667e+06
dtype: float64

6. Min

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 23/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [280]:  df.min()

Out[280]: one 1
two 10
three 100
four 1000
dtype: int64

7. Max

In [281]:  df.max()

Out[281]: one 4
two 40
three 400
four 4000
dtype: int64

8. Standard Deviation

In [282]:  df.std()

Out[282]: one 1.290994


two 12.909944
three 129.099445
four 1290.994449
dtype: float64

j) Describe Function

The describe() method in a Pandas DataFrame returns descriptive statistics of the data in
the DataFrame. It provides a quick summary of the central tendency, dispersion, and shape
of the distribution of a set of numerical data.

The default behavior of describe() is to compute descriptive statistics for all numerical
columns in the DataFrame. If you want to compute descriptive statistics for a specific
column, you can pass the name of the column as an argument.

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 24/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [284]:  data = {'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three': pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000]),
'five' : pd.Series(['A','B','C','D'])}


df = pd.DataFrame(data)

df.describe()

Out[284]: one two three four

count 4.000000 4.000000 4.000000 4.000000

mean 2.500000 25.000000 250.000000 2500.000000

std 1.290994 12.909944 129.099445 1290.994449

min 1.000000 10.000000 100.000000 1000.000000

25% 1.750000 17.500000 175.000000 1750.000000

50% 2.500000 25.000000 250.000000 2500.000000

75% 3.250000 32.500000 325.000000 3250.000000

max 4.000000 40.000000 400.000000 4000.000000

k) Pipe Functions

1. Pipe Function

The pipe() method in a Pandas DataFrame allows you to apply a function to the DataFrame,
similar to the way the apply() method works. The difference is that pipe() allows you to chain
multiple operations together by passing the output of one function to the input of the next
function.

In [286]:  data = {'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three': pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)
df

Out[286]: one two three four

0 1 10 100 1000

1 2 20 200 2000

2 3 30 300 3000

3 4 40 400 4000

Example 1

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 25/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [291]:  def add_(i,j):


return i + j

df.pipe(add_, 10)

Out[291]: one two three four

0 11 20 110 1010

1 12 30 210 2010

2 13 40 310 3010

3 14 50 410 4010

Example 2

In [294]:  def mean_(col):


return col.mean()

def square(i):
return i ** 2

df.pipe(mean_).pipe(square)

Out[294]: one 6.25


two 625.00
three 62500.00
four 6250000.00
dtype: float64

2. Apply Function

The apply() method in a Pandas DataFrame allows you to apply a function to the
DataFrame, either to individual elements or to the entire DataFrame. The function can be
either a built-in Python function or a user-defined function.

In [295]:  data = {'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three': pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)
df

print(df.apply(np.mean))

Out[295]: one two three four

0 1 10 100 1000

1 2 20 200 2000

2 3 30 300 3000

3 4 40 400 4000

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 26/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [301]:  df.apply(lambda x: x.max() - x.min())

Out[301]: one 3
two 30
three 300
four 3000
dtype: int64

3. Apply map function

The map() method in a Pandas DataFrame allows you to apply a function to each element
of a specific column of the DataFrame. The function can be either a built-in Python function
or a user-defined function.

In [303]:  df.applymap(lambda x : x*100)

Out[303]: one two three four

0 100 1000 10000 100000

1 200 2000 20000 200000

2 300 3000 30000 300000

3 400 4000 40000 400000

applymap and apply are both functions in the pandas library used for
applying a function to elements of a pandas DataFrame or Series.

applymap is used to apply a function to every element of a DataFrame. It


returns a new DataFrame where each element has been modified by the input
function.

apply is used to apply a function along any axis of a DataFrame or


Series. It returns either a Series or a DataFrame, depending on the axis
along which the function is applied and the return value of the function.
Unlike applymap, apply can take into account the context of the data, such
as the row or column label.

So, applymap is meant for element-wise operations while apply can be used
for both element-wise and row/column-wise operations.

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 27/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [312]:  df = pd.DataFrame({ 'A': [1.2, 3.4, 5.6],


'B': [7.8, 9.1, 2.3]})

df_1 = df.applymap(np.int64)
print(df_1)

df_2 = df.apply(lambda row : row.mean(), axis = 0)
print(df_2)

A B
0 1 7
1 3 9
2 5 2
A 3.4
B 6.4
dtype: float64

l) Reindex Function

The reindex function in Pandas is used to change the row labels and/or column labels of a
DataFrame. This function can be used to align data from multiple DataFrames or to update
the labels based on new data. The function takes in a list or an array of new labels as its first
argument and, optionally, a fill value to replace any missing values. The reindexing can be
done along either the row axis (0) or the column axis (1). The reindexed DataFrame is
returned.

Example 1 - Rows

In [333]:  data = { 'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three' : pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)

print(df)
print('-'*30)
print(df.reindex([1,0,3,2]))

one two three four


0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
------------------------------
one two three four
1 2 20 200 2000
0 1 10 100 1000
3 4 40 400 4000
2 3 30 300 3000

Example 2 - Columns

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 28/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [336]:  data = {'Name' : ['John', 'Jane', 'Jim', 'Joan'],


'Age' : [25, 30, 35, 40],
'City' : ['New York', 'Los Angeles', 'Chicago', 'Houston']}

df = pd.DataFrame(data)

df.reindex(columns = ['Name','City','Age'])

Out[336]: Name City Age

0 John New York 25

1 Jane Los Angeles 30

2 Jim Chicago 35

3 Joan Houston 40

m) Renaming Columns in Pandas DataFrame

The rename function in Pandas is used to change the row labels and/or column labels of a
DataFrame. It can be used to update the names of one or multiple rows or columns by
passing a dictionary of new names as its argument. The dictionary should have the old
names as keys and the new names as values

In [343]:  data = { 'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three' : pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)

df.rename(columns = {'one' : 'One','two': 'Two', 'three' : 'Three', 'fou
inplace = True, index = {0:'a',1:'b',2:'c',4:'d'})
df

Out[343]: One Two Three Four

a 1 10 100 1000

b 2 20 200 2000

c 3 30 300 3000

3 4 40 400 4000

n) Sorting in Pandas DataFrame

Pandas provides several methods to sort a DataFrame based on one or more columns.

sort_values: This method sorts the DataFrame based on one or more columns. The
default sorting order is ascending, but you can change it to descending by passing the
ascending argument with a value of False. bash

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 29/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [355]:  data = { 'one' : pd.Series([11, 51, 31, 41]),


'two' : pd.Series([10, 20, 30, 40]),
'three' : pd.Series([100, 200, 500, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)
df

Out[355]: one two three four

0 11 10 100 1000

1 51 20 200 2000

2 31 30 500 3000

3 41 40 400 4000

Sort with respect to Scecific Column

In [356]:  df.sort_values(by = 'one')

Out[356]: one two three four

0 11 10 100 1000

2 31 30 500 3000

3 41 40 400 4000

1 51 20 200 2000

Sort in Scecific Order

In [357]:  df.sort_values(by = 'one', ascending = False)

Out[357]: one two three four

1 51 20 200 2000

3 41 40 400 4000

2 31 30 500 3000

0 11 10 100 1000

Sort in Scecific Order based on multiple Columns

In [359]:  df.sort_values(by = ['one','two'])

Out[359]: one two three four

0 11 10 100 1000

2 31 30 500 3000

3 41 40 400 4000

1 51 20 200 2000

Sort with Specific Sorting Algorithm:<br>

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 30/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

quicksort
mergesort
heapsort

In [361]:  df.sort_values(by = ['one'], kind = 'healsort')

Out[361]: one two three four

0 11 10 100 1000

2 31 30 500 3000

3 41 40 400 4000

1 51 20 200 2000

o) Groupby Functions

The groupby function in pandas is used to split a dataframe into groups based on one or
more columns. It returns a DataFrameGroupBy object, which is similar to a DataFrame but
has some additional methods to perform operations on the grouped data.

In [362]:  cricket = {'Team' : ['India', 'India', 'Australia', 'Australia', 'SA',


'Rank' : [2, 3, 1,2, 3,4 ,1 ,1,2 , 4,1,2],
'Year' : [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014
'Points' : [876,801,891,815,776,784,834,824,758,691,883,782]}

df = pd.DataFrame(cricket)
df

Out[362]: Team Rank Year Points

0 India 2 2014 876

1 India 3 2015 801

2 Australia 1 2014 891

3 Australia 2 2015 815

4 SA 3 2014 776

5 SA 4 2015 784

6 SA 1 2016 834

7 SA 1 2017 824

8 NZ 2 2016 758

9 NZ 4 2014 691

10 NZ 1 2015 883

11 India 2 2017 782

In [365]:  df.groupby('Team').groups

Out[365]: {'Australia': [2, 3], 'India': [0, 1, 11], 'NZ': [8, 9, 10], 'SA': [4,
5, 6, 7]}

Austrealia is present in index 2 and 3

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 31/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

India is present in index 0,1 and 11 and so on

To search for specific Country with specific year

In [366]:  df.groupby(['Team','Year']).get_group(('Australia',2014))

Out[366]: Team Rank Year Points

2 Australia 1 2014 891

If the data is not present then we will be getting an error

Adding some statistical computation on top of groupby

In [374]:  df.groupby('Team').sum()['Points']

Out[374]: Team
Australia 1706
India 2459
NZ 2332
SA 3218
Name: Points, dtype: int64

This means we have displayed the teams which are having the maximum sum in Poitns

Let us sort it to get it in a better way

In [377]:  df.groupby('Team').sum()['Points'].sort_values(ascending = False)

Out[377]: Team
SA 3218
India 2459
NZ 2332
Australia 1706
Name: Points, dtype: int64

Checking multiple stats for points team wise

In [382]:  groups = df.groupby('Team')



groups['Points'].agg([np.sum, np.mean, np.std,np.max,np.min])

Out[382]: sum mean std amax amin

Team

Australia 1706 853.000000 53.740115 891 815

India 2459 819.666667 49.702448 876 782

NZ 2332 777.333333 97.449132 883 691

SA 3218 804.500000 28.769196 834 776

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 32/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

filter function along with groupby

In [386]:  df.groupby('Team').filter(lambda x : len(x) == 4)

Out[386]: Team Rank Year Points

4 SA 3 2014 776

5 SA 4 2015 784

6 SA 1 2016 834

7 SA 1 2017 824

The data of South Africa are present equal to 4 times that is why South Africa is being
displayed here

In [388]:  df.groupby('Team').filter(lambda x : len(x) == 3)

Out[388]: Team Rank Year Points

0 India 2 2014 876

1 India 3 2015 801

8 NZ 2 2016 758

9 NZ 4 2014 691

10 NZ 1 2015 883

11 India 2 2017 782

The data of India and New Zealand are present 3 times so that is why they are being
displayed here

3. Working with csv files and basic data Analysis


Using Pandas

a) Reading csv

Reading csv files from local system

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 33/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [398]:  df = pd.read_csv('Football.csv')

df.head()

Out[398]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names

Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon

Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann

Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez

Ruben
3 Spain La Liga (CAR) 32 3 2842 13 14.06
Castro

Kevin
4 Spain La Liga (VAL) 21 10 1745 13 10.65
Gameiro

Reading CSV files from github repositories


NOTE: The link of the page should be copied when the file is in raw format

In [391]:  link = 'https://raw.githubusercontent.com/AshishJangra27/Data-Analysis-w



# df = pd.read_csv(link)
# df.head()

Out[391]: Con
App Category Rating Reviews Size Installs Type Price
Ra

Photo
Editor &
Candy
0 ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Every
Camera &
Grid &
ScrapBook

Coloring
1 book ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Every
moana

U
Launcher
Lite –
2 FREE Live ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Every
Cool
Themes,
Hide ...

Sketch -
3 Draw & ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 T
Paint

Pixel Draw
- Number
4 Art ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Every
Coloring
Book

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 34/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

b) Pandas Info Function

Pandas dataframe.info() function is used to get a concise summary of the dataframe. It


comes really handy when doing exploratory analysis of the data. To get a quick overview of
the dataset we use the dataframe.info() function.

Syntax: DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None,


null_counts=None)

In [399]:  df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 660 non-null object
1 League 660 non-null object
2 Club 660 non-null object
3 Player Names 660 non-null object
4 Matches_Played 660 non-null int64
5 Substitution 660 non-null int64
6 Mins 660 non-null int64
7 Goals 660 non-null int64
8 xG 660 non-null float64
9 xG Per Avg Match 660 non-null float64
10 Shots 660 non-null int64
11 OnTarget 660 non-null int64
12 Shots Per Avg Match 660 non-null float64
13 On Target Per Avg Match 660 non-null float64
14 Year 660 non-null int64
dtypes: float64(4), int64(7), object(4)
memory usage: 77.5+ KB

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 35/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

c) isnull() function to check if there are nan values


present

In [400]:  df.isnull()

Out[400]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names

0 False False False False False False False False False

1 False False False False False False False False False

2 False False False False False False False False False

3 False False False False False False False False False

4 False False False False False False False False False

... ... ... ... ... ... ... ... ... ...

655 False False False False False False False False False

656 False False False False False False False False False

657 False False False False False False False False False

658 False False False False False False False False False

659 False False False False False False False False False

660 rows × 15 columns

So we can see we are getting a boolean kind of a table giving True and False

If we use the sum function along with it then we can get how many null values are present
in each columns

In [401]:  df.isnull().sum()

Out[401]: Country 0
League 0
Club 0
Player Names 0
Matches_Played 0
Substitution 0
Mins 0
Goals 0
xG 0
xG Per Avg Match 0
Shots 0
OnTarget 0
Shots Per Avg Match 0
On Target Per Avg Match 0
Year 0
dtype: int64

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 36/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

d) Quantile function to get the specific percentile


value

Let us check the 80 percentile value of each columns using describe function first

In [404]:  df.describe(percentiles = [.80])

Out[404]:
xG Per
Matches_Played Substitution Mins Goals xG
Avg Match

count 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 6

mean 22.371212 3.224242 2071.416667 11.810606 10.089606 0.476167

std 9.754658 3.839498 900.595049 6.075315 5.724844 0.192831

min 2.000000 0.000000 264.000000 2.000000 0.710000 0.070000

50% 24.000000 2.000000 2245.500000 11.000000 9.285000 0.435000

80% 32.000000 6.000000 2915.800000 15.000000 14.076000 0.610000

max 38.000000 26.000000 4177.000000 42.000000 32.540000 1.350000 2

So we can see the 80th Percentile value of Mins is 2915.80

Let us use the quantile function to get the exact value now

In [406]:  df['Mins'].quantile(.80)

Out[406]: 2915.8

Here we go we got the same value

To get the 99 percentile value we can write

In [407]:  df['Mins'].quantile(.99)

Out[407]: 3520.0199999999995

This funciton is important as it can be used to treat ourliers in Data Science EDA
process

e) Copy function

If we normal do:
de=df
Then a change in de will affect the data of df as well so we need to copy in such a way
that it creates a totally new object and does not affect the old dataframe

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 37/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [413]:  de = df.copy()
de.head(3)

Out[413]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names

Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon

Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann

Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez

In [414]:  de['Year+100'] = de['Year'] + 100


de.head()

Out[414]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names

Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon

Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann

Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez

Ruben
3 Spain La Liga (CAR) 32 3 2842 13 14.06
Castro

Kevin
4 Spain La Liga (VAL) 21 10 1745 13 10.65
Gameiro

So we can see a new column has been added here but our old data is secured

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 38/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [415]:  df.head()

Out[415]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names

Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon

Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann

Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez

Ruben
3 Spain La Liga (CAR) 32 3 2842 13 14.06
Castro

Kevin
4 Spain La Liga (VAL) 21 10 1745 13 10.65
Gameiro

The new column is not present here

f) Value Counts function

Pandas Series.value_counts() function return a Series containing counts of unique values.


The resulting object will be in descending order so that the first element is the most
frequently-occurring element. Excludes NA values by default.

Syntax: Series.value_counts(normalize=False, sort=True, ascending=False, bins=None,


dropna=True)

In [417]:  df['Player Names'].value_counts()

Out[417]: Andrea Belotti 5


Lionel Messi 5
Luis Suarez 5
Andrej Kramaric 5
Ciro Immobile 5
..
Francois Kamano 1
Lebo Mothiba 1
Gaetan Laborde 1
Falcao 1
Cody Gakpo 1
Name: Player Names, Length: 444, dtype: int64

g) Unique and Nunique Function

While analyzing the data, many times the user wants to see the unique values in a particular
column, which can be done using Pandas unique() function.

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 39/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [418]:  df['Player Names'].unique()

Out[418]: array(['Juanmi Callejon', 'Antoine Griezmann', 'Luis Suarez',


'Ruben Castro', 'Kevin Gameiro', 'Cristiano Ronaldo',
'Karim Benzema', 'Neymar ', 'Iago Aspas', 'Sergi Enrich',
'Aduriz ', 'Sandro Ramlrez', 'Lionel Messi', 'Gerard Moreno',
'Morata', 'Wissam Ben Yedder', 'Willian Jose', 'Andone ',
'Cedric Bakambu', 'Isco', 'Mohamed Salah', 'Gregoire Defrel',
'Ciro Immobile', 'Nikola Kalinic', 'Dries Mertens',
'Alejandro Gomez', 'Jose CallejOn', 'Iago Falque',
'Giovanni Simeone', 'Mauro Icardi', 'Diego Falcinelli',
'Cyril Thereau', 'Edin Dzeko', 'Lorenzo Insigne',
'Fabio Quagliarella', 'Borriello ', 'Carlos Bacca',
'Gonzalo Higuain', 'Keita Balde', 'Andrea Belotti', 'Fin Bart
els',
'Lars Stindl', 'Serge Gnabry', 'Wagner ', 'Andrej Kramaric',
'Florian Niederlechner', 'Robert Lewandowski', 'Emil Forsber
g',
'Timo Werner', 'Nils Petersen', 'Vedad Ibisevic', 'Mario Gome
z',
'Maximilian Philipp', 'A\x81dam Szalai',
'Pi E i k A b ' 'G id B t ll ' 'M K
While analyzing the data, many times the user wants to see the unique values in a particular
column. Pandas nunique() is used to get a count of unique values.

In [419]:  df['Player Names'].nunique()

Out[419]: 444

h) dropna() function

Sometimes csv file has null values, which are later displayed as NaN in Data Frame.
Pandas dropna() method allows the user to analyze and drop Rows/Columns with Null
values in different ways.

Syntax:

DataFrameName.dropna(axis=0,inplace=False)

axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer
and ‘index’ or ‘columns’ for String.

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 40/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [422]:  link = 'https://raw.githubusercontent.com/AshishJangra27/Data-Analysis-w



df = pd.read_csv(link)
df.head()

Out[422]: Con
App Category Rating Reviews Size Installs Type Price
Ra

Photo
Editor &
Candy
0 ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Every
Camera &
Grid &
ScrapBook

Coloring
1 book ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Every
moana

U
Launcher
Lite –
2 FREE Live ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Every
Cool
Themes,
Hide ...

Sketch -
3 Draw & ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 T
Paint

Pixel Draw
- Number
4 Art ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Every
Coloring
Book

In [423]:  df.isnull().sum()

Out[423]: App 0
Category 0
Rating 1474
Reviews 0
Size 0
Installs 0
Type 1
Price 0
Content Rating 1
Genres 0
Last Updated 0
Current Ver 8
Android Ver 3
dtype: int64

ok so it seems like we have alot of Null Values in column Rating and few null values in
some other columns

In [426]:  df.dropna(inplace = True, axis = 0)

This will delete all the rows which are containing the null values

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 41/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [427]:  df.dropna(inplace = True, axis = 1)

This will delete all the columns containing null values

i) Fillna Function

Pandas Series.fillna() function is used to fill NA/NaN values using the specified method.

Suppose if we want to fill the null values with something instead of removing them then we
can use fillna function
Here we will be filling the numerical columns with its mean values and Categorical columns
with its mode

In [447]:  link = 'https://raw.githubusercontent.com/AshishJangra27/Data-Analysis-w



df = pd.read_csv(link)

print(len(df))

10841

Numerical columns

In [448]:  mis = round(df['Rating'].mean(),2)



df['Rating'] = df['Rating'].fillna(mis)

print(len(df))

10841

If we would have used inplcae=True then it would have permenantly stored those values in
our dataframe

Categorical values

In [461]:  df['Current Ver'] = df['Current Ver'].fillna('Varies on Device')

j) sample function

Pandas sample() is used to generate a sample random row or column from the function
caller data frame.

Syntax:

DataFrame.sample(n=None, frac=None, replace=False, weights=None,


random_state=None, axis=None)

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 42/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

In [471]:  df.sample(5)

Out[471]:
App Category Rating Reviews Size Installs Type Pric

Displaying
9083 PHOTOGRAPHY 4.19 1 67M 50+ Free
You VR

Eternal
1547 LIBRARIES_AND_DEMO 5.00 26 2.5M 1,000+ Free
life

Safest
433 Call COMMUNICATION 4.40 27540 3.7M 1,000,000+ Free
Blocker

San
Andreas
Varies
Crime
7452 FAMILY 4.20 9403 with 1,000,000+ Free
City
device
Gangster
3D

ADP
190 Mobile BUSINESS 4.30 85185 29M 5,000,000+ Free
Solutions

k) to_csv() function

Pandas Series.to_csv() function write the given series object to a comma-separated values
(csv) file/format.

Syntax: Series.to_csv(*args, **kwargs)

In [477]:  data = { 'one' : pd.Series([1, 2, 3, 4]),


'two' : pd.Series([10, 20, 30, 40]),
'three' : pd.Series([100, 200, 300, 400]),
'four' : pd.Series([1000, 2000, 3000, 4000])}

df = pd.DataFrame(data)

df.to_csv('Number.csv')

We got an extra Unnamed:0 Column if we want to avoid that we need to add an extra
parameter mentioning index=False

In [478]:  df.to_csv('Numbers.csv', index = False)

4. A detailed Pandas Profile report

The pandas_profiling library in Python include a method named as ProfileReport() which


generate a basic report on the input DataFrame.

The report consist of the following:

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 43/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook

DataFrame overview Each attribute on which DataFrame is defined Correlations between


In [480]:  import matplotlib
import pandas_profiling as pp

In [484]:  df = pd.read_csv('Football.csv')
df.head()

Out[484]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names

Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon

Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann

Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez

Ruben
3 Spain La Liga (CAR) 32 3 2842 13 14.06
Castro

Kevin
4 Spain La Liga (VAL) 21 10 1745 13 10.65
Gameiro

In [485]:  report = pp.ProfileReport(df)

In [486]:  report

Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]

Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]

Render HTML: 0%| | 0/1 [00:00<?, ?it/s]

Out[486]:

In [ ]:  ​

localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 44/44

You might also like