0% found this document useful (0 votes)

26 views36 pages

Pandas Module (Part-I)

The document discusses the Pandas module in Python. Pandas allows working with and analyzing datasets and is used for tasks like cleaning, exploring and manipulating data. It introduces the concepts of Series and DataFrames which are the primary data structures for working with data in Pandas.

Uploaded by

Aditya Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views36 pages

Pandas Module (Part-I)

Uploaded by

Aditya Chauhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

PANDAS MODULE

05/19/24
 Pandas stands for Panel Data.
 It is a Python Library for analyzing Data. It is used for
working with Datasets.
 It is used for cleaning, analyzing, exploring and
manipulating Data.
 Pandas allows us to analyze big data and make conclusions
based on statistical theories.
 Pandas are also able to delete rows that are not relevant, or
contains wrong values, like empty or NULL values. This is
called cleaning the data.
 Command to install pandas is:-
pip install pandas

05/19/24
Series and DataFrame
 Data can be in Series or in DataFrame form.
 Series is holding the data in 1-D array while

DataFrame is holding the data in 2-D array.

DataFrame is just a Table where each column is
Series.
Series+Series+……..=DataFrame
We know that 2D array is a collection of several 1D
arrays. Similarly DataFrame is a collection of
several Series.
05/19/24
Example

05/19/24
Creating Series (like 1D array)
import pandas as pd
a=[3,2,0,1]
S=pd.Series(a)
print(S)

Output:
0 3
1 2
2 0
3 1
dtype: int64

05/19/24
 The first column shows the numeric indexes of the element of the
Series. With this we can access any element of Series like
print(S[2]) will print 0. These indexes are treated as Labels.
 We can change these numeric indexes label by our named key
indexes as
s=pd.Series(a,index=[‘a’,’b’,’c’,’d’])
a 3
b 2
c 0
d 1
dtype: int64
Now we can access any element of series by these key indexes like
print(s[‘a’]) will print 3

05/19/24
 import pandas as pd
d = {"day1": 420, "day2": 380, "day3": 390}
s = pd.Series(d)
print(s)

Output:-
day1 420
day2 380
day3 390
dtype: int64

In this, obviously keys will be treated as key indexes

05/19/24
 We can also skip some elements of the dictionary
in Series by just mentioning the key in index
keyword argument.
s=pd.Series(d, index=[‘day1, ‘day2’])
We have skipped day3 of dictionary.

In Series() method, we can pass either List or Tuple

or Dictionary

05/19/24
Creating DataFrame (like 2D Array or Table)

import pandas as pd
d={
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(d)
print(df)
Output:-
calories duration
0 420 50
1 380 40
2 390 45
05/19/24
 Now in this, keys will the names of the columns
not the names of the rows while in case of series,
they were the names of the rows.
 loc attribute to return one or more specified row(s)

print(df.loc[0])
Output:-
calories 420
duration 50
Name: 0, dtype: int64

05/19/24
print(df.loc[[0, 1]])
calories duration
0 420 50
1 380 40

Also, we can slicing df.loc[i:j], but here upper

bound is included.

05/19/24
Index keyword Argument
 import pandas as pd

d={
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

05/19/24
Reading CSV file for creating DataFrame

 import pandas as pd
df = pd.read_csv('data.csv')
print(df) #will print starting and ending 5 rows
only
print(df.to_string()) #will print all rows

05/19/24
 import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.tail())
print(df.info()) #will give info about data

05/19/24
Output
 <class 'pandas.core.frame.DataFrame'> RangeIndex: 169 entries, 0
to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None

05/19/24
 In our data set it seems like there are 164 of 169
Non-Null values in the "Calories" column.
Which means that there are 5 rows with no value at
all, in the "Calories" column, for whatever reason.

 Empty values, or Null values, can be bad when

analyzing data, and we should consider removing
rows with empty values. This is a step towards
what is called cleaning data,
05/19/24
Cleaning Data
 Data cleaning means fixing bad data in our data set.
Bad data could be:
 Empty cells (see Row18,22,28)
 Data in wrong format (see Row 26)
 Wrong data (see Row 7)
 Duplicates (see Row 11 and 12)

05/19/24
05/19/24
I. Cleaning Empty Cells
 Empty cells can give us wrong result when we
analyze data.
 (1)One way to deal with empty cells is to remove

rows that contain empty cells. This is usually OK,

since data sets can be very big, and removing a few
rows will not have a big impact on the result.
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna() #Return a new Data Frame with no empty cells
print(new_df.)
05/19/24
 newdf=df.dropna() will return a new DataFrame
keeping Original DataFrame unaffected. If we
want to make changes in orginal DataFrame, then
we have to use inplace=True
df.dropna(inplace=True)

If we write newdf=df.dropna(inplace=True), then

also changes will reflect in original DataFrame
only, newdf will contain None.
05/19/24
 (2) Second method is instead of dropping entire
row, we will fill that particular cell by new value.
we do not have to delete entire rows just because
of some empty cells. fillna() method allows us to
replace empty cells with a value:

import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
05/19/24
 The example above replaces all empty cells of all
the columns of Data Frame.
To only replace empty values for one column,
specify the column name for the DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True)

05/19/24
 A common way to replace empty cells, is to calculate
the mean, median or mode value of the column.
Pandas uses the mean() median() and mode() methods to
calculate the respective values for a specified column:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)

05/19/24
 Mean = the average value (the sum of all values
divided by number of values).
 Median = the value in the middle, after you have
sorted all values ascending.
 Mode = the value that appears most frequently.

Note: In case of mode, we will use the statement as

x=df[‘calroies’].mode()[0]

05/19/24
II. Cleaning wrong format
 Cells with data of wrong format can make it
difficult, or even impossible, to analyze data.
 To fix it, we have two options: remove the rows, or
convert all cells in the columns into the same
format.
 Removing the rows of a particular column
df.dropna(subset=['Date'], inplace = True)

05/19/24
 convert all cells in the columns into the same
format
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df)

05/19/24
III. Cleaning Wrong Data
 "Wrong data" does not have to be "empty cells" or "wrong
format", it can just be wrong, like if someone registered "199"
instead of "1.99".
 Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.
 If you take a look at our data set, you can see that in row 7, the
duration is 450, but for all the other rows the duration is
between 30 and 60.
 It doesn't have to be wrong, but taking in consideration that this
is the data set of someone's workout sessions, we conclude with
the fact that this person did not work out in 450 minutes.

05/19/24
05/19/24
 One way to fix wrong values is to replace them
with something else. In our example, it is most
likely a typo, and the value should be "45" instead
of "450", and we could just insert "45" in row 7:
df.loc[7, 'Duration'] = 45
or df.loc[row,column]=value

For small data sets you might be able to replace the

wrong data one by one, but not for big data sets.
05/19/24
 for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration“] = 120
 Another way of handling wrong data is to remove the

rows that contains wrong data.This way we do not have

to find out what to replace them with, and there is a
good chance you do not need them to do your analyses.
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
05/19/24
IV. Removing Duplicate rows
 df.drop_duplicates(inplace = True)
Rows 11 and 12 are duplicate rows

05/19/24
05/19/24
Finding Relationships between Columns

 A great aspect of the Pandas module is

the corr() method.
 The corr() method calculates the relationship
between each column in our data set.
df.corr()

Note: The corr() method ignores "not numeric"

columns.

05/19/24
05/19/24
Result Explained
 The Result of the corr() method is a table with a lot of numbers
that represents how well the relationship is between two
columns. The number varies from -1 to 1.
 1 means that there is a 1 to 1 relationship (a perfect correlation),
and for this data set, each time a value went up in the first
column, the other one went up as well.
 0.9 is also a good relationship, and if you increase one value, the
other will probably increase as well.
 -0.9 would be just as good relationship as 0.9, but if you
increase one value, the other will probably go down.
 0.2 means NOT a good relationship, meaning that if one value
goes up does not mean that the other will.
05/19/24
 Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000,
which makes sense, each column always has a perfect relationship with
itself.
 Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more
calories you burn, and the other way around: if you burned a lot of
calories, you probably had a long work out.
 Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very
bad correlation, meaning that we can not predict the max pulse by just
looking at the duration of the work out, and vice versa.
05/19/24

Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
No ratings yet
Unit 1 FUNDAMENTALS OF DATA SCIENCE-1
27 pages
Python For Analytics - 2025 - 2020
No ratings yet
Python For Analytics - 2025 - 2020
28 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
12 Information Practices Text Book Preeti Arora
No ratings yet
12 Information Practices Text Book Preeti Arora
45 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Class 12 Preboard Paper
No ratings yet
Class 12 Preboard Paper
15 pages
Python Pandas Demo PDF
100% (2)
Python Pandas Demo PDF
23 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Analysis of Algorithms: Matplotlib and Pandas Dataframe
No ratings yet
Analysis of Algorithms: Matplotlib and Pandas Dataframe
67 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Practice 1
No ratings yet
Practice 1
45 pages
1 Pandas Basics
No ratings yet
1 Pandas Basics
13 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Pandas
No ratings yet
Pandas
63 pages
MOD-3 Dap
No ratings yet
MOD-3 Dap
41 pages
Pandas
No ratings yet
Pandas
41 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
Pandas Data Analytics
No ratings yet
Pandas Data Analytics
61 pages
Session2-DM Using Pandas
No ratings yet
Session2-DM Using Pandas
51 pages
Dataframes UNIT 1 PART 2
No ratings yet
Dataframes UNIT 1 PART 2
33 pages
Hands On Data Cleaning With Pandas and NumPy
No ratings yet
Hands On Data Cleaning With Pandas and NumPy
20 pages
DS (Pandas)
No ratings yet
DS (Pandas)
17 pages
Data Science - Sec4
No ratings yet
Data Science - Sec4
16 pages
Pandas
No ratings yet
Pandas
30 pages
DataFrame Ac Win Final
No ratings yet
DataFrame Ac Win Final
30 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
19 pages
Data Science Notes Unit-1 Part - 2
No ratings yet
Data Science Notes Unit-1 Part - 2
22 pages
Pandas AI
No ratings yet
Pandas AI
14 pages
Lab 1 ML Lab
No ratings yet
Lab 1 ML Lab
15 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Bulba Code ICE - RLHF Synthetic & Organic Loss
No ratings yet
Bulba Code ICE - RLHF Synthetic & Organic Loss
94 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Data Analyst Roadmap
No ratings yet
Data Analyst Roadmap
5 pages
Data Science - Sec3
No ratings yet
Data Science - Sec3
27 pages
Pandas
No ratings yet
Pandas
21 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Lecture 7 Understanding Dataframes in Python and R
No ratings yet
Lecture 7 Understanding Dataframes in Python and R
17 pages
2.1 Pandas Objects
No ratings yet
2.1 Pandas Objects
10 pages
Panas Short Notes
No ratings yet
Panas Short Notes
4 pages
Pandas
No ratings yet
Pandas
94 pages
Industrial Training Report
No ratings yet
Industrial Training Report
42 pages
Pandas
No ratings yet
Pandas
7 pages
QP of IP - 1st Preboard 2024-25 - Set1
No ratings yet
QP of IP - 1st Preboard 2024-25 - Set1
14 pages
Lab 9
No ratings yet
Lab 9
9 pages
AI Lab 05 Lab Tasks Maaz
No ratings yet
AI Lab 05 Lab Tasks Maaz
23 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
Pandas Notes
No ratings yet
Pandas Notes
10 pages
Python Unit 3 4
No ratings yet
Python Unit 3 4
92 pages
Final Report-Ifat Mallik-2111317
No ratings yet
Final Report-Ifat Mallik-2111317
21 pages
Internship Report AIML
No ratings yet
Internship Report AIML
65 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Lab 1 4 (AI) 1
No ratings yet
Lab 1 4 (AI) 1
35 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Report New
No ratings yet
Report New
32 pages
Panda's Cheat Sheet
No ratings yet
Panda's Cheat Sheet
18 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Holiday Homework-Xii-2024
No ratings yet
Holiday Homework-Xii-2024
34 pages
Pandas - Datastructures
No ratings yet
Pandas - Datastructures
19 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Data Analytics Brouchure
No ratings yet
Data Analytics Brouchure
15 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Vignesh Final Mini Project
No ratings yet
Vignesh Final Mini Project
39 pages
Analytics or Computing With Python
No ratings yet
Analytics or Computing With Python
2 pages
Data Analytics - Project Videos & Ideas
No ratings yet
Data Analytics - Project Videos & Ideas
6 pages
Pandas Notes
No ratings yet
Pandas Notes
5 pages
PEOJECTTTTTTTTTT
No ratings yet
PEOJECTTTTTTTTTT
22 pages
At Vs Iat in Pandas by Jaume Boguñá
No ratings yet
At Vs Iat in Pandas by Jaume Boguñá
10 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Xii Ip Hy 23 24
No ratings yet
Xii Ip Hy 23 24
13 pages
Sankhyana Data Science Course Outline
No ratings yet
Sankhyana Data Science Course Outline
10 pages
Pandas
No ratings yet
Pandas
8 pages
Wa0001
No ratings yet
Wa0001
3 pages
Pandas Notes
No ratings yet
Pandas Notes
8 pages
Resume - Pulkit Dembla - Data Analyst
No ratings yet
Resume - Pulkit Dembla - Data Analyst
2 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page

Pandas Module (Part-I)

Uploaded by

Pandas Module (Part-I)

Uploaded by

PANDAS MODULE

DataFrame is holding the data in 2-D array.

In this, obviously keys will be treated as key indexes

In Series() method, we can pass either List or Tuple

Also, we can slicing df.loc[i:j], but here upper

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

 Empty values, or Null values, can be bad when

rows that contain empty cells. This is usually OK,

If we write newdf=df.dropna(inplace=True), then

Note: In case of mode, we will use the statement as

For small data sets you might be able to replace the

rows that contains wrong data.This way we do not have

 A great aspect of the Pandas module is

Note: The corr() method ignores "not numeric"

You might also like