Pandas - Jupyter Notebook
Pandas - Jupyter Notebook
GeeksforGeeks
Pandas
In [6]: import numpy as np
import pandas as pd
Table of Contents
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 1/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
3. Working with csv files and basic data Analysis Using Pandas
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 2/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
a) Creating Series
Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integer, string, float, python objects, etc.). The axis labels are collectively called index.
Labels need not be unique but must be a hashable type. The object supports both integer
and label-based indexing and provides a host of methods for performing operations
involving the index.
In [ ]: lst = [1,2,3,4,5]
pd.Series(lst)
In [ ]: arr = np.array([1,2,3,4,5])
pd.Series(arr)
Out[12]: Eshant 1
Pranjal 2
Jayesh 3
Ashish 4
dtype: int64
Pandas Series.repeat() function repeat elements of a Series. It returns a new Series where
each element of the current Series is repeated consecutively a given number of times.
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 3/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [19]: pd.Series(5).repeat(3)
Out[19]: 0 5
0 5
0 5
dtype: int64
Out[27]: 0 5
1 5
2 5
dtype: int64
Out[29]: 0 10
1 10
2 10
3 10
4 10
5 20
6 20
dtype: int64
Accessing elements
In [34]: s[4]
Out[34]: 10
s[0] or s[50] something like this would not work becasue the we can access elements based
on the index which we procided
In [38]: s[6]
Out[38]: 20
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 4/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [49]: s[2:-2]
Out[49]: 2 10
3 10
4 10
dtype: int64
Pandas Series.aggregate() function aggregate using one or more operations over the
specified axis in the given series object.
In [58]: sr = pd.Series([1,2,3,4,5,6,7])
sr.agg([min,max,sum])
Out[58]: min 1
max 7
sum 28
dtype: int64
Pandas Series.abs() method is used to get the absolute numeric value of each element in
Series/DataFrame.
In [60]: sr = pd.Series([1,-2,3,-4,5,-6,7])
sr.abs()
Out[60]: 0 1
1 2
2 3
3 4
4 5
5 6
6 7
dtype: int64
d) Appending Series
Parameter : to_append : Series or list/tuple of Series ignore_index : If True, do not use the
index labels. verify_integrity : If True, raise Exception on creating index with duplicates
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 5/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Out[67]: 0 1
1 2
2 3
0 1
1 -2
2 3
dtype: int64
Out[71]: 0 1
1 2
2 3
3 1
4 -2
5 3
dtype: int64
e) Astype function
Pandas astype() is the one of the most important methods. It is used to change data type of
a series. When data frame is made from a csv file, the columns are imported and data type
is set automatically which many times is not what it actually should have.
In [75]: sr1
Out[75]: 0 1
1 -2
2 3
dtype: int64
In [76]: type(sr1[0])
Out[76]: numpy.int64
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 6/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [80]: sr1.astype('float')
Out[80]: 0 1.0
1 -2.0
2 3.0
dtype: float64
f) Between Function
Pandas between() method is used on series to check which values lie between first and
second argument.
Out[86]: 0 1
1 2
2 30
3 4
4 5
5 6
6 7
7 8
8 9
9 20
dtype: int64
In [87]: sr1.between(10,50)
Out[87]: 0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 True
dtype: bool
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 7/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [88]: ser = pd.Series(["Eshant Das" , "Data Science" , "Geeks for Geeks" , 'He
In [92]: print(ser.str.upper())
print('-'*30)
print(ser.str.lower())
0 ESHANT DAS
1 DATA SCIENCE
2 GEEKS FOR GEEKS
3 HELLO WORLD
4 MACHINE LEARNING
dtype: object
------------------------------
0 eshant das
1 data science
2 geeks for geeks
3 hello world
4 machine learning
dtype: object
Length function
10
12
15
11
16
Strip Function
In [95]: ser = pd.Series([" Eshant Das" , "Data Science" , "Geeks for Geeks" , '
for i in ser:
print(i , len(i))
Eshant Das 12
Data Science 12
Geeks for Geeks 15
Hello World 11
Machine Learning 18
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 8/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Eshant Das 10
Data Science 12
Geeks for Geeks 15
Hello World 11
Machine Learning 16
Split Function
In [108]: ser.str.split()
IF we want to split onlt the first world of every string in the pandas series
In [109]: ser.str.split()[0]
In [110]: ser.str.split()[1]
Contains Function
Out[126]: 0 False
1 True
2 False
3 True
4 False
dtype: bool
Replace Function
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 9/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Count Function
In [128]: ser.str.count('a')
Out[128]: 0 2
1 2
2 0
3 0
4 2
dtype: int64
In [129]: ser.str.startswith('D')
Out[129]: 0 False
1 True
2 False
3 False
4 False
dtype: bool
In [130]: ser.str.endswith('s')
Out[130]: 0 True
1 False
2 True
3 False
4 False
dtype: bool
Find Function
In [133]: ser.str.find('Geeks')
Out[133]: 0 -1
1 -1
2 0
3 -1
4 -1
dtype: int64
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 10/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Pandas tolist() is used to convert a series to list. Initially the series is of type
pandas.core.series.
In [137]: ser.to_list()
In the real world, a Pandas DataFrame will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be
created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created
in different ways here are some ways by which we create a dataframe:
Out[161]: 0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks
Out[163]: 0 1
0 tom 10
1 jerry 12
2 spike 14
To create DataFrame from dict of narray/list, all the narray must be of same length. If index
is passed then the length index should be equal to the length of arrays. If no index is
passed, then by default, index will be range(n) where n is the array length.
In [166]: data = {'name':['Tom', 'nick', 'krish', 'jack'], 'age':[20, 21, 19, 18]}
pd.DataFrame(data)
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
Column Selection: In Order to select a column in Pandas DataFrame, we can either access
the columns by calling them by their columns name.
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 12/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd
Pandas comprises many methods for its proper functioning. loc() and iloc() are one of those
methods. These are used in slicing data from the Pandas DataFrame. They help in the
convenient selection of data from the DataFrame in Python. They are used in filtering the
data according to some conditions.
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
Python loc() function The loc() function is label based data selecting method which means
that we have to pass the name of the row or column which we want to select. This method
includes the last element of the range passed in it, unlike iloc(). loc() can accept the boolean
data unlike iloc(). Many operations can be performed using the loc() method like
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 13/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
1 20 200
2 30 300
The iloc() function is an indexed-based selecting method which means that we have to pass
an integer index in the method to select a specific row/column. This method does not
include the last element of the range passed in it unlike loc(). iloc() does not accept the
boolean data unlike loc().
1 20 200
2 30 300
you can see index 3 of both row and column has not been added here so 1 was
inclusize but 3 is exclusive in the case of ilocs
In [195]: df.iloc[:,2:3]
Out[195]: three
0 100
1 200
2 300
3 400
In [197]: df.iloc[[0,2],[1,3]]
0 10 1000
2 30 3000
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 14/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
2 300 3000
3 400 4000
So we could extract only those data for which the value is more than 20
For the columns we have used comma(,) to extract specifc columns which is 'three' and
'four'
0 1 1000
1 2 2000
So you can get the inference in the same way for this code as we got for the previous
code
In [208]: df
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
We can add a column in many ways. Let us discuss three ways how we can add column
here
Using List
Using Pandas Series
Using an existing Column(we can modify that column in the way we want and that
modified part can also be displayed)
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 15/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [210]: l = [22,33,44,55]
df['five'] = l
df
0 1 10 100 1000 22
1 2 20 200 2000 33
2 3 30 300 3000 44
3 4 40 400 4000 55
In [211]: sr = pd.Series([111,222,333,444])
df['six'] = sr
df
Now we can see the column 7 is having all the values of column 1 increamented by 10
In [217]: df
Using del
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 16/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
You can see that the column which had the name 'six' has been deleted
0 1 10 100 1000 22 11
1 2 20 200 2000 33 12
2 3 30 300 3000 44 13
3 4 40 400 4000 55 14
Using pop
You can see that the columm five has also been deleted from our dataframe
In [220]: df.pop('five')
df
0 1 10 100 1000 11
1 2 20 200 2000 12
2 3 30 300 3000 13
3 4 40 400 4000 14
e) Addition of rows
In a Pandas DataFrame, you can add rows by using the append method. You can also
create a new DataFrame with the desired row values and use the append to add the new
row to the original dataframe. Here's an example of adding a single row to a dataframe:
Out[228]: a b
0 1 2
1 3 4
2 5 6
3 7 8
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 17/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric Python packages. Pandas is one of those packages and makes
importing and analyzing data much easier.
Pandas provide data analysts a way to delete and filter data frame using .drop() method.
Rows or columns can be removed using index label or column name using this method.
Parameters:
labels: String or list of strings referring row or column name. axis: int or string value, 0 ‘index’
for Rows and 1 ‘columns’ for Columns. index or columns: Single label or list. index or
columns are an alternative to axis and cannot be used together. level: Used to specify level
in case data frame is having multiple level index. inplace: Makes changes in original Data
Frame if True. errors: Ignores error if any value from the list doesn’t exists and drops rest of
the values when errors = ‘ignore’
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
2 3 30 300 3000
3 4 40 400 4000
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 18/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
2 30 3000
3 40 4000
g) Transposing a DataFrame
The .T attribute in a Pandas DataFrame is used to transpose the dataframe, i.e., to flip the
rows and columns. The result of transposing a dataframe is a new dataframe with the
original rows as columns and the original columns as rows.
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
In [244]: df.T
Out[244]: 0 1 2 3
one 1 2 3 4
two 10 20 30 40
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 19/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [245]: df
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
1. axes function
The .axes attribute in a Pandas DataFrame returns a list with the row and column labels of
the DataFrame. The first element of the list is the row labels (index), and the second
element is the column labels.
In [246]: df.axes
2. ndim function
The .ndim attribute in a Pandas DataFrame returns the number of dimensions of the
dataframe, which is always 2 for a DataFrame (row-and-column format).
In [247]: df.ndim
Out[247]: 2
3. dtypes
The .dtypes attribute in a Pandas DataFrame returns the data types of the columns in the
DataFrame. The result is a Series with the column names as index and the data types of the
columns as values.
In [248]: df.dtypes
4. shape function
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 20/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
The .shape attribute in a Pandas DataFrame returns the dimensions (number of rows,
number of columns) of the DataFrame as a tuple.
In [249]: df.shape
Out[249]: (4, 4)
4 rows
4 columns
5. head() function
0 Tom 10 3.25
1 Jerry 12 1.11
2 Spike 14 4.12
3 Popeye 30 5.47
4 Olive 28 6.15
5 Bluto 33 6.67
6 Mickey 15 2.61
The .head() method in a Pandas DataFrame returns the first n rows (by default, n=5) of the
DataFrame. This method is useful for quickly examining the first few rows of a large
DataFrame to get a sense of its structure and content.
In [259]: df.head(3)
0 Tom 10 3.25
1 Jerry 12 1.11
2 Spike 14 4.12
6. df.tail() function
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 21/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
The .tail() method in a Pandas DataFrame returns the last n rows (by default, n=5) of the
DataFrame. This method is useful for quickly examining the last few rows of a large
DataFrame to get a sense of its structure and content.
In [260]: df.tail(3)
4 Olive 28 6.15
5 Bluto 33 6.67
6 Mickey 15 2.61
7. empty function()
The .empty attribute in a Pandas DataFrame returns a Boolean value indicating whether the
DataFrame is empty or not. A DataFrame is considered empty if it has no rows.
In [263]: df = pd.DataFrame()
df.empty
Out[263]: True
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
1. Sum
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 22/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [266]: df.sum()
Out[266]: one 10
two 100
three 1000
four 10000
dtype: int64
2. Mean
In [267]: df.mean()
3. Median
In [269]: df.median()
4. Mode
A 0 4
dtype: int64
B 0 20
1 40
dtype: int64
5. Variance
In [279]: df.var()
6. Min
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 23/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [280]: df.min()
Out[280]: one 1
two 10
three 100
four 1000
dtype: int64
7. Max
In [281]: df.max()
Out[281]: one 4
two 40
three 400
four 4000
dtype: int64
8. Standard Deviation
In [282]: df.std()
j) Describe Function
The describe() method in a Pandas DataFrame returns descriptive statistics of the data in
the DataFrame. It provides a quick summary of the central tendency, dispersion, and shape
of the distribution of a set of numerical data.
The default behavior of describe() is to compute descriptive statistics for all numerical
columns in the DataFrame. If you want to compute descriptive statistics for a specific
column, you can pass the name of the column as an argument.
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 24/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
k) Pipe Functions
1. Pipe Function
The pipe() method in a Pandas DataFrame allows you to apply a function to the DataFrame,
similar to the way the apply() method works. The difference is that pipe() allows you to chain
multiple operations together by passing the output of one function to the input of the next
function.
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
Example 1
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 25/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
0 11 20 110 1010
1 12 30 210 2010
2 13 40 310 3010
3 14 50 410 4010
Example 2
2. Apply Function
The apply() method in a Pandas DataFrame allows you to apply a function to the
DataFrame, either to individual elements or to the entire DataFrame. The function can be
either a built-in Python function or a user-defined function.
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 26/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Out[301]: one 3
two 30
three 300
four 3000
dtype: int64
The map() method in a Pandas DataFrame allows you to apply a function to each element
of a specific column of the DataFrame. The function can be either a built-in Python function
or a user-defined function.
applymap and apply are both functions in the pandas library used for
applying a function to elements of a pandas DataFrame or Series.
So, applymap is meant for element-wise operations while apply can be used
for both element-wise and row/column-wise operations.
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 27/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
A B
0 1 7
1 3 9
2 5 2
A 3.4
B 6.4
dtype: float64
l) Reindex Function
The reindex function in Pandas is used to change the row labels and/or column labels of a
DataFrame. This function can be used to align data from multiple DataFrames or to update
the labels based on new data. The function takes in a list or an array of new labels as its first
argument and, optionally, a fill value to replace any missing values. The reindexing can be
done along either the row axis (0) or the column axis (1). The reindexed DataFrame is
returned.
Example 1 - Rows
Example 2 - Columns
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 28/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
2 Jim Chicago 35
3 Joan Houston 40
The rename function in Pandas is used to change the row labels and/or column labels of a
DataFrame. It can be used to update the names of one or multiple rows or columns by
passing a dictionary of new names as its argument. The dictionary should have the old
names as keys and the new names as values
a 1 10 100 1000
b 2 20 200 2000
c 3 30 300 3000
3 4 40 400 4000
Pandas provides several methods to sort a DataFrame based on one or more columns.
sort_values: This method sorts the DataFrame based on one or more columns. The
default sorting order is ascending, but you can change it to descending by passing the
ascending argument with a value of False. bash
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 29/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
0 11 10 100 1000
1 51 20 200 2000
2 31 30 500 3000
3 41 40 400 4000
0 11 10 100 1000
2 31 30 500 3000
3 41 40 400 4000
1 51 20 200 2000
1 51 20 200 2000
3 41 40 400 4000
2 31 30 500 3000
0 11 10 100 1000
0 11 10 100 1000
2 31 30 500 3000
3 41 40 400 4000
1 51 20 200 2000
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 30/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
quicksort
mergesort
heapsort
0 11 10 100 1000
2 31 30 500 3000
3 41 40 400 4000
1 51 20 200 2000
o) Groupby Functions
The groupby function in pandas is used to split a dataframe into groups based on one or
more columns. It returns a DataFrameGroupBy object, which is similar to a DataFrame but
has some additional methods to perform operations on the grouped data.
4 SA 3 2014 776
5 SA 4 2015 784
6 SA 1 2016 834
7 SA 1 2017 824
8 NZ 2 2016 758
9 NZ 4 2014 691
10 NZ 1 2015 883
In [365]: df.groupby('Team').groups
Out[365]: {'Australia': [2, 3], 'India': [0, 1, 11], 'NZ': [8, 9, 10], 'SA': [4,
5, 6, 7]}
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 31/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [366]: df.groupby(['Team','Year']).get_group(('Australia',2014))
In [374]: df.groupby('Team').sum()['Points']
Out[374]: Team
Australia 1706
India 2459
NZ 2332
SA 3218
Name: Points, dtype: int64
This means we have displayed the teams which are having the maximum sum in Poitns
Out[377]: Team
SA 3218
India 2459
NZ 2332
Australia 1706
Name: Points, dtype: int64
Team
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 32/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
4 SA 3 2014 776
5 SA 4 2015 784
6 SA 1 2016 834
7 SA 1 2017 824
The data of South Africa are present equal to 4 times that is why South Africa is being
displayed here
8 NZ 2 2016 758
9 NZ 4 2014 691
10 NZ 1 2015 883
The data of India and New Zealand are present 3 times so that is why they are being
displayed here
a) Reading csv
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 33/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [398]: df = pd.read_csv('Football.csv')
df.head()
Out[398]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names
Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon
Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann
Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez
Ruben
3 Spain La Liga (CAR) 32 3 2842 13 14.06
Castro
Kevin
4 Spain La Liga (VAL) 21 10 1745 13 10.65
Gameiro
Out[391]: Con
App Category Rating Reviews Size Installs Type Price
Ra
Photo
Editor &
Candy
0 ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Every
Camera &
Grid &
ScrapBook
Coloring
1 book ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Every
moana
U
Launcher
Lite –
2 FREE Live ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Every
Cool
Themes,
Hide ...
Sketch -
3 Draw & ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 T
Paint
Pixel Draw
- Number
4 Art ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Every
Coloring
Book
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 34/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [399]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 660 non-null object
1 League 660 non-null object
2 Club 660 non-null object
3 Player Names 660 non-null object
4 Matches_Played 660 non-null int64
5 Substitution 660 non-null int64
6 Mins 660 non-null int64
7 Goals 660 non-null int64
8 xG 660 non-null float64
9 xG Per Avg Match 660 non-null float64
10 Shots 660 non-null int64
11 OnTarget 660 non-null int64
12 Shots Per Avg Match 660 non-null float64
13 On Target Per Avg Match 660 non-null float64
14 Year 660 non-null int64
dtypes: float64(4), int64(7), object(4)
memory usage: 77.5+ KB
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 35/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [400]: df.isnull()
Out[400]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names
... ... ... ... ... ... ... ... ... ...
655 False False False False False False False False False
656 False False False False False False False False False
657 False False False False False False False False False
658 False False False False False False False False False
659 False False False False False False False False False
So we can see we are getting a boolean kind of a table giving True and False
If we use the sum function along with it then we can get how many null values are present
in each columns
In [401]: df.isnull().sum()
Out[401]: Country 0
League 0
Club 0
Player Names 0
Matches_Played 0
Substitution 0
Mins 0
Goals 0
xG 0
xG Per Avg Match 0
Shots 0
OnTarget 0
Shots Per Avg Match 0
On Target Per Avg Match 0
Year 0
dtype: int64
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 36/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Let us check the 80 percentile value of each columns using describe function first
Out[404]:
xG Per
Matches_Played Substitution Mins Goals xG
Avg Match
Let us use the quantile function to get the exact value now
In [406]: df['Mins'].quantile(.80)
Out[406]: 2915.8
In [407]: df['Mins'].quantile(.99)
Out[407]: 3520.0199999999995
This funciton is important as it can be used to treat ourliers in Data Science EDA
process
e) Copy function
If we normal do:
de=df
Then a change in de will affect the data of df as well so we need to copy in such a way
that it creates a totally new object and does not affect the old dataframe
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 37/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [413]: de = df.copy()
de.head(3)
Out[413]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names
Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon
Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann
Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez
Out[414]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names
Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon
Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann
Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez
Ruben
3 Spain La Liga (CAR) 32 3 2842 13 14.06
Castro
Kevin
4 Spain La Liga (VAL) 21 10 1745 13 10.65
Gameiro
So we can see a new column has been added here but our old data is secured
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 38/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [415]: df.head()
Out[415]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names
Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon
Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann
Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez
Ruben
3 Spain La Liga (CAR) 32 3 2842 13 14.06
Castro
Kevin
4 Spain La Liga (VAL) 21 10 1745 13 10.65
Gameiro
While analyzing the data, many times the user wants to see the unique values in a particular
column, which can be done using Pandas unique() function.
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 39/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Out[419]: 444
h) dropna() function
Sometimes csv file has null values, which are later displayed as NaN in Data Frame.
Pandas dropna() method allows the user to analyze and drop Rows/Columns with Null
values in different ways.
Syntax:
DataFrameName.dropna(axis=0,inplace=False)
axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer
and ‘index’ or ‘columns’ for String.
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 40/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
Out[422]: Con
App Category Rating Reviews Size Installs Type Price
Ra
Photo
Editor &
Candy
0 ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Every
Camera &
Grid &
ScrapBook
Coloring
1 book ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Every
moana
U
Launcher
Lite –
2 FREE Live ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Every
Cool
Themes,
Hide ...
Sketch -
3 Draw & ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 T
Paint
Pixel Draw
- Number
4 Art ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Every
Coloring
Book
In [423]: df.isnull().sum()
Out[423]: App 0
Category 0
Rating 1474
Reviews 0
Size 0
Installs 0
Type 1
Price 0
Content Rating 1
Genres 0
Last Updated 0
Current Ver 8
Android Ver 3
dtype: int64
ok so it seems like we have alot of Null Values in column Rating and few null values in
some other columns
This will delete all the rows which are containing the null values
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 41/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
i) Fillna Function
Pandas Series.fillna() function is used to fill NA/NaN values using the specified method.
Suppose if we want to fill the null values with something instead of removing them then we
can use fillna function
Here we will be filling the numerical columns with its mean values and Categorical columns
with its mode
10841
Numerical columns
10841
If we would have used inplcae=True then it would have permenantly stored those values in
our dataframe
Categorical values
j) sample function
Pandas sample() is used to generate a sample random row or column from the function
caller data frame.
Syntax:
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 42/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [471]: df.sample(5)
Out[471]:
App Category Rating Reviews Size Installs Type Pric
Displaying
9083 PHOTOGRAPHY 4.19 1 67M 50+ Free
You VR
Eternal
1547 LIBRARIES_AND_DEMO 5.00 26 2.5M 1,000+ Free
life
Safest
433 Call COMMUNICATION 4.40 27540 3.7M 1,000,000+ Free
Blocker
San
Andreas
Varies
Crime
7452 FAMILY 4.20 9403 with 1,000,000+ Free
City
device
Gangster
3D
ADP
190 Mobile BUSINESS 4.30 85185 29M 5,000,000+ Free
Solutions
k) to_csv() function
Pandas Series.to_csv() function write the given series object to a comma-separated values
(csv) file/format.
We got an extra Unnamed:0 Column if we want to avoid that we need to add an extra
parameter mentioning index=False
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 43/44
2/13/24, 2:41 PM Pandas - Jupyter Notebook
In [484]: df = pd.read_csv('Football.csv')
df.head()
Out[484]:
Player
Country League Club Matches_Played Substitution Mins Goals xG
Names
Juanmi
0 Spain La Liga (BET) 19 16 1849 11 6.62
Callejon
Antoine
1 Spain La Liga (BAR) 36 0 3129 16 11.86
Griezmann
Luis
2 Spain La Liga (ATL) 34 1 2940 28 23.21
Suarez
Ruben
3 Spain La Liga (CAR) 32 3 2842 13 14.06
Castro
Kevin
4 Spain La Liga (VAL) 21 10 1745 13 10.65
Gameiro
In [486]: report
Out[486]:
In [ ]:
localhost:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 44/44