Pandas Module (Part-I)
Pandas Module (Part-I)
05/19/24
Pandas stands for Panel Data.
It is a Python Library for analyzing Data. It is used for
working with Datasets.
It is used for cleaning, analyzing, exploring and
manipulating Data.
Pandas allows us to analyze big data and make conclusions
based on statistical theories.
Pandas are also able to delete rows that are not relevant, or
contains wrong values, like empty or NULL values. This is
called cleaning the data.
Command to install pandas is:-
pip install pandas
05/19/24
Series and DataFrame
Data can be in Series or in DataFrame form.
Series is holding the data in 1-D array while
05/19/24
Creating Series (like 1D array)
import pandas as pd
a=[3,2,0,1]
S=pd.Series(a)
print(S)
Output:
0 3
1 2
2 0
3 1
dtype: int64
05/19/24
The first column shows the numeric indexes of the element of the
Series. With this we can access any element of Series like
print(S[2]) will print 0. These indexes are treated as Labels.
We can change these numeric indexes label by our named key
indexes as
s=pd.Series(a,index=[‘a’,’b’,’c’,’d’])
a 3
b 2
c 0
d 1
dtype: int64
Now we can access any element of series by these key indexes like
print(s[‘a’]) will print 3
05/19/24
import pandas as pd
d = {"day1": 420, "day2": 380, "day3": 390}
s = pd.Series(d)
print(s)
Output:-
day1 420
day2 380
day3 390
dtype: int64
05/19/24
We can also skip some elements of the dictionary
in Series by just mentioning the key in index
keyword argument.
s=pd.Series(d, index=[‘day1, ‘day2’])
We have skipped day3 of dictionary.
05/19/24
Creating DataFrame (like 2D Array or Table)
import pandas as pd
d={
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(d)
print(df)
Output:-
calories duration
0 420 50
1 380 40
2 390 45
05/19/24
Now in this, keys will the names of the columns
not the names of the rows while in case of series,
they were the names of the rows.
loc attribute to return one or more specified row(s)
print(df.loc[0])
Output:-
calories 420
duration 50
Name: 0, dtype: int64
05/19/24
print(df.loc[[0, 1]])
calories duration
0 420 50
1 380 40
05/19/24
Index keyword Argument
import pandas as pd
d={
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
05/19/24
Reading CSV file for creating DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
print(df) #will print starting and ending 5 rows
only
print(df.to_string()) #will print all rows
05/19/24
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
print(df.tail())
print(df.info()) #will give info about data
05/19/24
Output
<class 'pandas.core.frame.DataFrame'> RangeIndex: 169 entries, 0
to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None
05/19/24
In our data set it seems like there are 164 of 169
Non-Null values in the "Calories" column.
Which means that there are 5 rows with no value at
all, in the "Calories" column, for whatever reason.
05/19/24
05/19/24
I. Cleaning Empty Cells
Empty cells can give us wrong result when we
analyze data.
(1)One way to deal with empty cells is to remove
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
05/19/24
The example above replaces all empty cells of all
the columns of Data Frame.
To only replace empty values for one column,
specify the column name for the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True)
05/19/24
A common way to replace empty cells, is to calculate
the mean, median or mode value of the column.
Pandas uses the mean() median() and mode() methods to
calculate the respective values for a specified column:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
05/19/24
Mean = the average value (the sum of all values
divided by number of values).
Median = the value in the middle, after you have
sorted all values ascending.
Mode = the value that appears most frequently.
05/19/24
II. Cleaning wrong format
Cells with data of wrong format can make it
difficult, or even impossible, to analyze data.
To fix it, we have two options: remove the rows, or
convert all cells in the columns into the same
format.
Removing the rows of a particular column
df.dropna(subset=['Date'], inplace = True)
05/19/24
convert all cells in the columns into the same
format
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df)
05/19/24
III. Cleaning Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong
format", it can just be wrong, like if someone registered "199"
instead of "1.99".
Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.
If you take a look at our data set, you can see that in row 7, the
duration is 450, but for all the other rows the duration is
between 30 and 60.
It doesn't have to be wrong, but taking in consideration that this
is the data set of someone's workout sessions, we conclude with
the fact that this person did not work out in 450 minutes.
05/19/24
05/19/24
One way to fix wrong values is to replace them
with something else. In our example, it is most
likely a typo, and the value should be "45" instead
of "450", and we could just insert "45" in row 7:
df.loc[7, 'Duration'] = 45
or df.loc[row,column]=value
05/19/24
05/19/24
Finding Relationships between Columns
05/19/24
05/19/24
Result Explained
The Result of the corr() method is a table with a lot of numbers
that represents how well the relationship is between two
columns. The number varies from -1 to 1.
1 means that there is a 1 to 1 relationship (a perfect correlation),
and for this data set, each time a value went up in the first
column, the other one went up as well.
0.9 is also a good relationship, and if you increase one value, the
other will probably increase as well.
-0.9 would be just as good relationship as 0.9, but if you
increase one value, the other will probably go down.
0.2 means NOT a good relationship, meaning that if one value
goes up does not mean that the other will.
05/19/24
Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000,
which makes sense, each column always has a perfect relationship with
itself.
Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more
calories you burn, and the other way around: if you burned a lot of
calories, you probably had a long work out.
Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very
bad correlation, meaning that we can not predict the max pulse by just
looking at the duration of the work out, and vice versa.
05/19/24