0% found this document useful (0 votes)
26 views36 pages

Pandas Module (Part-I)

The document discusses the Pandas module in Python. Pandas allows working with and analyzing datasets and is used for tasks like cleaning, exploring and manipulating data. It introduces the concepts of Series and DataFrames which are the primary data structures for working with data in Pandas.

Uploaded by

Aditya Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views36 pages

Pandas Module (Part-I)

The document discusses the Pandas module in Python. Pandas allows working with and analyzing datasets and is used for tasks like cleaning, exploring and manipulating data. It introduces the concepts of Series and DataFrames which are the primary data structures for working with data in Pandas.

Uploaded by

Aditya Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

PANDAS MODULE

05/19/24
 Pandas stands for Panel Data.
 It is a Python Library for analyzing Data. It is used for
working with Datasets.
 It is used for cleaning, analyzing, exploring and
manipulating Data.
 Pandas allows us to analyze big data and make conclusions
based on statistical theories.
 Pandas are also able to delete rows that are not relevant, or
contains wrong values, like empty or NULL values. This is
called cleaning the data.
 Command to install pandas is:-
pip install pandas

05/19/24
Series and DataFrame
 Data can be in Series or in DataFrame form.
 Series is holding the data in 1-D array while

DataFrame is holding the data in 2-D array.


DataFrame is just a Table where each column is
Series.
Series+Series+……..=DataFrame
We know that 2D array is a collection of several 1D
arrays. Similarly DataFrame is a collection of
several Series.
05/19/24
Example

05/19/24
Creating Series (like 1D array)
import pandas as pd
a=[3,2,0,1]
S=pd.Series(a)
print(S)

Output:
0 3
1 2
2 0
3 1
dtype: int64

05/19/24
 The first column shows the numeric indexes of the element of the
Series. With this we can access any element of Series like
print(S[2]) will print 0. These indexes are treated as Labels.
 We can change these numeric indexes label by our named key
indexes as
s=pd.Series(a,index=[‘a’,’b’,’c’,’d’])
a 3
b 2
c 0
d 1
dtype: int64
Now we can access any element of series by these key indexes like
print(s[‘a’]) will print 3

05/19/24
 import pandas as pd
d = {"day1": 420, "day2": 380, "day3": 390}
s = pd.Series(d)
print(s)

Output:-
day1 420
day2 380
day3 390
dtype: int64

In this, obviously keys will be treated as key indexes

05/19/24
 We can also skip some elements of the dictionary
in Series by just mentioning the key in index
keyword argument.
s=pd.Series(d, index=[‘day1, ‘day2’])
We have skipped day3 of dictionary.

In Series() method, we can pass either List or Tuple


or Dictionary

05/19/24
Creating DataFrame (like 2D Array or Table)

import pandas as pd
d={
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(d)
print(df)
Output:-
calories duration
0 420 50
1 380 40
2 390 45
05/19/24
 Now in this, keys will the names of the columns
not the names of the rows while in case of series,
they were the names of the rows.
 loc attribute to return one or more specified row(s)

print(df.loc[0])
Output:-
calories 420
duration 50
Name: 0, dtype: int64

05/19/24
print(df.loc[[0, 1]])
calories duration
0 420 50
1 380 40

Also, we can slicing df.loc[i:j], but here upper


bound is included.

05/19/24
Index keyword Argument
 import pandas as pd

d={
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

05/19/24
Reading CSV file for creating DataFrame

 import pandas as pd
df = pd.read_csv('data.csv')
print(df) #will print starting and ending 5 rows
only
print(df.to_string()) #will print all rows

05/19/24
 import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.tail())
print(df.info()) #will give info about data

05/19/24
Output
 <class 'pandas.core.frame.DataFrame'> RangeIndex: 169 entries, 0
to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None

05/19/24
 In our data set it seems like there are 164 of 169
Non-Null values in the "Calories" column.
Which means that there are 5 rows with no value at
all, in the "Calories" column, for whatever reason.

 Empty values, or Null values, can be bad when


analyzing data, and we should consider removing
rows with empty values. This is a step towards
what is called cleaning data,
05/19/24
Cleaning Data
 Data cleaning means fixing bad data in our data set.
Bad data could be:
 Empty cells (see Row18,22,28)
 Data in wrong format (see Row 26)
 Wrong data (see Row 7)
 Duplicates (see Row 11 and 12)

05/19/24
05/19/24
I. Cleaning Empty Cells
 Empty cells can give us wrong result when we
analyze data.
 (1)One way to deal with empty cells is to remove

rows that contain empty cells. This is usually OK,


since data sets can be very big, and removing a few
rows will not have a big impact on the result.
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna() #Return a new Data Frame with no empty cells
print(new_df.)
05/19/24
 newdf=df.dropna() will return a new DataFrame
keeping Original DataFrame unaffected. If we
want to make changes in orginal DataFrame, then
we have to use inplace=True
df.dropna(inplace=True)

If we write newdf=df.dropna(inplace=True), then


also changes will reflect in original DataFrame
only, newdf will contain None.
05/19/24
 (2) Second method is instead of dropping entire
row, we will fill that particular cell by new value.
we do not have to delete entire rows just because
of some empty cells. fillna() method allows us to
replace empty cells with a value:

import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
05/19/24
 The example above replaces all empty cells of all
the columns of Data Frame.
To only replace empty values for one column,
specify the column name for the DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True)

05/19/24
 A common way to replace empty cells, is to calculate
the mean, median or mode value of the column.
Pandas uses the mean() median() and mode() methods to
calculate the respective values for a specified column:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)

05/19/24
 Mean = the average value (the sum of all values
divided by number of values).
 Median = the value in the middle, after you have
sorted all values ascending.
 Mode = the value that appears most frequently.

Note: In case of mode, we will use the statement as


x=df[‘calroies’].mode()[0]

05/19/24
II. Cleaning wrong format
 Cells with data of wrong format can make it
difficult, or even impossible, to analyze data.
 To fix it, we have two options: remove the rows, or
convert all cells in the columns into the same
format.
 Removing the rows of a particular column
df.dropna(subset=['Date'], inplace = True)

05/19/24
 convert all cells in the columns into the same
format
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df)

05/19/24
III. Cleaning Wrong Data
 "Wrong data" does not have to be "empty cells" or "wrong
format", it can just be wrong, like if someone registered "199"
instead of "1.99".
 Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.
 If you take a look at our data set, you can see that in row 7, the
duration is 450, but for all the other rows the duration is
between 30 and 60.
 It doesn't have to be wrong, but taking in consideration that this
is the data set of someone's workout sessions, we conclude with
the fact that this person did not work out in 450 minutes.

05/19/24
05/19/24
 One way to fix wrong values is to replace them
with something else. In our example, it is most
likely a typo, and the value should be "45" instead
of "450", and we could just insert "45" in row 7:
df.loc[7, 'Duration'] = 45
or df.loc[row,column]=value

For small data sets you might be able to replace the


wrong data one by one, but not for big data sets.
05/19/24
 for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration“] = 120
 Another way of handling wrong data is to remove the

rows that contains wrong data.This way we do not have


to find out what to replace them with, and there is a
good chance you do not need them to do your analyses.
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
05/19/24
IV. Removing Duplicate rows
 df.drop_duplicates(inplace = True)
Rows 11 and 12 are duplicate rows

05/19/24
05/19/24
Finding Relationships between Columns

 A great aspect of the Pandas module is


the corr() method.
 The corr() method calculates the relationship
between each column in our data set.
df.corr()

Note: The corr() method ignores "not numeric"


columns.

05/19/24
05/19/24
Result Explained
 The Result of the corr() method is a table with a lot of numbers
that represents how well the relationship is between two
columns. The number varies from -1 to 1.
 1 means that there is a 1 to 1 relationship (a perfect correlation),
and for this data set, each time a value went up in the first
column, the other one went up as well.
 0.9 is also a good relationship, and if you increase one value, the
other will probably increase as well.
 -0.9 would be just as good relationship as 0.9, but if you
increase one value, the other will probably go down.
 0.2 means NOT a good relationship, meaning that if one value
goes up does not mean that the other will.
05/19/24
 Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000,
which makes sense, each column always has a perfect relationship with
itself.
 Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more
calories you burn, and the other way around: if you burned a lot of
calories, you probably had a long work out.
 Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very
bad correlation, meaning that we can not predict the max pulse by just
looking at the duration of the work out, and vice versa.
05/19/24

You might also like