0% found this document useful (0 votes)
19 views17 pages

DS (Pandas)

Uploaded by

deepti.u.1228
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views17 pages

DS (Pandas)

Uploaded by

deepti.u.1228
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Pandas

Pandas is a Python library that is used to work with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" refers to both "Panel Data" and "Python Data Analysis"
and was created by Wes McKinney in 2008.

Pandas allow us to analyze big data and make conclusions based on


statistical theories. Pandas can clean messy data sets, and make them
readable and relevant. Relevant data is very important in data science.

What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

output:
0 1
1 7
2 2
dtype: int64

Labels
If nothing else is specified, the values are labeled with their index number.
First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

print(myvar[0]

Create Labels
With the index argument, you can name your own labels.

import pandas as pd
a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"]

print(myvar)

output

x 1
y 7
z 2
dtype: int64

Key/Value Objects as Series


You can also use a key/value object, like a dictionary when creating a Series.

The keys of the dictionary become the labels.

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

output:

day1 420
day2 380
day3 390
dtype: int64

DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional
array, or a table with rows and columns.Data sets in Pandas are usually multi-
dimensional tables, called DataFrames. Series is like a column, a DataFrame
is the whole table.

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]}
df= pd.DataFrame(data)
print(df)
output

calories duration

0 420 50

1 380 40

2 390 45

Locate Row
Pandas use the loc attribute to return one or more specified row(s)

#refer to the row index:

print(df.loc[0])

Read CSV Files


A simple way to store big data sets is to use CSV files (comma separated
files).CSV files contains plain text and is a well know format that can be read
by everyone including Pandas.

C:\Users\LUV\Downloads\data.csv

Load the CSV into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string()) #use to_string() to print the entire DataFrame.

If you have a large DataFrame with many rows, Pandas will only return the
first 5 rows, and the last 5 rows:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with
the pd.options.display.max_rows statement.

Check the number of maximum returned rows:

import pandas as pd

print(pd.options.display.max_rows)

In my system the number is 60, which means that if the DataFrame contains
more than 60 rows, the print(df) statement will return only the headers and
the first and last 5 rows.

You can change the maximum rows number with the same statement.

Increase the maximum number of rows to display the entire


DataFrame:
import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('data.csv')

print(df)

Analyzing DataFrames
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame,
is the head() method.

The head() method returns the headers and a specified number of rows,
starting from the top.

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10)) # printing the first 10 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())
Print the first 5 rows of the DataFrame:
The tail() method returns the headers and a specified number of rows,
starting from the bottom. Print the last 5 rows of the DataFrame:
print(df.tail())

Pandas - Cleaning Data


Data Cleaning
Data cleaning means fixing bad data in your data set.

Bad data could be:

 Empty cells
 Data in wrong format
 Wrong data
 Duplicates

Duration Date Pulse Maxpulse Calories

0 60 '2020/12/01' 110 130 409.1

1 60 '2020/12/02' 117 145 479.0

2 60 '2020/12/03' 103 135 340.0

3 45 '2020/12/04' 109 175 282.4

4 45 '2020/12/05' 117 148 406.0

5 60 '2020/12/06' 102 127 300.0

6 60 '2020/12/07' 110 136 374.0

7 450 '2020/12/08' 104 134 253.3

8 30 '2020/12/09' 109 133 195.1

9 60 '2020/12/10' 98 124 269.0

10 60 '2020/12/11' 103 147 329.3

11 60 '2020/12/12' 100 120 250.7

12 60 '2020/12/12' 100 120 250.7

13 60 '2020/12/13' 106 128 345.3

14 60 '2020/12/14' 104 132 379.3


15 60 '2020/12/15' 98 123 275.0

16 60 '2020/12/16' 98 120 215.2

17 60 '2020/12/17' 100 120 300.0

18 45 '2020/12/18' 90 112 NaN

19 60 '2020/12/19' 103 123 323.0

20 45 '2020/12/20' 97 125 243.0

21 60 '2020/12/21' 108 131 364.2

22 45 NaN 100 119 282.0

23 60 '2020/12/23' 130 101 300.0

24 45 '2020/12/24' 105 132 246.0

25 60 '2020/12/25' 102 126 334.5

26 60 2020/12/26 100 120 250.0

27 60 '2020/12/27' 92 118 241.0

28 60 '2020/12/28' 103 132 NaN

29 60 '2020/12/29' 100 132 280.0

30 60 '2020/12/30' 102 129 380.3

31 60 '2020/12/31' 92 115 243.0

The data set contains some empty cells ("Date" in row 22, and "Calories" in
row 18 and 28).

The data set contains wrong format ("Date" in row 26).

The data set contains wrong data ("Duration" in row 7).

The data set contains duplicates (row 11 and 12).

Empty cells can potentially give you a wrong result when you analyze
data.

Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows
will not have a big impact on the result.

Return a new Data Frame with no empty cells:

By default, the dropna() method returns a new DataFrame, and will


not change the original.

import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())

If you want to change the original DataFrame, use the inplace =


True argument:

import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string()) # Removes all the rows with null value.

Replace Empty Values


Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty
cells.

The fillna() method allows us to replace empty cells with a value:

import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True) # Replace null value with number


130.

To only replace empty values for one column, specify the column name for
the DataFrame.

import pandas as pd

df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True) # Replace the null
values in calories column with the number 130.

import pandas as pd
df = pd.read_csv(r'C:\Users\LUV\OneDrive\Desktop\data.csv')
df.fillna(130, inplace = True)
print(df.to_string())
df.to_csv('C:\\Users\\LUV\\OneDrive\\Desktop\\data.csv')

Replace Using Mean, Median, or Mode


A common way to replace empty cells, is to calculate the mean, median or
mode value of the column.

Pandas uses the mean() median() and mode().methods to calculate the


respective values for a specified column.

#Calculate the MEAN, and replace any empty values with it:

import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)

#Calculate the MEDIAN, and replace any empty values with it:

import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)

#Calculate the MODE, and replace any empty values with it:

import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].mode()[0]

df["Calories"].fillna(x, inplace = True)


Pandas - Cleaning Data of Wrong Format
Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to
analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the
columns into the same format.

Let's try to convert all cells in the 'Date' column into dates.

Pandas has a to_datetime() method for this:

import pandas as pd

df = pd.read_csv('data.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string()

Pandas - Removing Duplicates


By taking a look at our test data set, we can assume that row 11 and 12 are
duplicates.

To discover duplicates, we can use the duplicated() method.

The duplicated() method returns a Boolean values for each row:

print(df.duplicated()) #Returns True for every row that is a


duplicate, otherwise False.

Removing Duplicates
To remove duplicates, use the drop_duplicates() method.

df.drop_duplicates(inplace = True) #Remove all duplicates.

import pandas as pd

# create dataframe

data = {
'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],

'Age': [8, 9, 7, 80, 100],

'Gender': ['M', 'M', 'M', 'F', 'M'],

'Standard': [3, 4, 12, 3, 5]

df = pd.DataFrame(data)

# replace F with M

df.loc[3, 'Gender'] = 'M'

print(df)

Replace Values Based on a Conditionimport pandas as pd

# create dataframe

data = {

'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],

'Age': [8, 9, 7, 80, 100],

'Gender': ['M', 'M', 'M', 'M', 'M'],

'Standard': [3, 4, 12, 3, 5]}

df = pd.DataFrame(data)# replace values based on conditions

for i in df.index:

age_val = df.loc[i, 'Age']

if (age_val > 14) and (age_val%10 == 0):

df.loc[i, 'Age'] = age_val/10

print(df)

Pandas - Data Correlations


Finding Relationships
A great aspect of the Pandas module is the corr() method.

The corr() method calculates the relationship between each column in your
data set.

The examples in this page uses a CSV file called: 'data.csv'.

df.corr() #Show the relationship between the columns.

Duration Pulse Maxpulse Calories

Duration 1.000000 -0.155408 0.009403 0.922721

Pulse -0.155408 1.000000 0.786535 0.025120

Maxpulse 0.009403 0.786535 1.000000 0.203814

Calories 0.922721 0.025120 0.203814 1.000000

The corr() method ignores "not numeric" columns.

Result Explained

The Result of the corr() method is a table with a lot of numbers that
represents how well the relationship is between two columns.

The number varies from -1 to 1.

1 means that there is a 1 to 1 relationship (a perfect correlation), and for this


data set, each time a value went up in the first column, the other one went
up as well.

0.9 is also a good relationship, and if you increase one value, the other will
probably increase as well.

-0.9 would be just as good relationship as 0.9, but if you increase one value,
the other will probably go down.

0.2 means NOT a good relationship, meaning that if one value goes up does
not mean that the other will.

Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000, which
makes sense, each column always has a perfect relationship with itself.
Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more
calories you burn, and the other way around: if you burned a lot of calories,
you probably had a long work out.

Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad
correlation, meaning that we can not predict the max pulse by just looking at
the duration of the work out, and vice versa.

What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a
visualization utility.

Matplotlib was created by John D. Hunter.

Matplotlib is open source and we can use it freely.

Matplotlib is mostly written in python, a few segments are written in C,


Objective-C and Javascript for Platform compatibility.

Pandas - Plotting
Plotting
Pandas uses the plot() method to create diagrams.

We can use Pyplot, a submodule of the Matplotlib library to visualize the


diagram on the screen.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')
df.plot()
plt.show()

Scatter Plot
Specify that you want a scatter plot with the kind argument:

kind = 'scatter'

A scatter plot needs an x- and a y-axis.

In the example below we will use "Duration" for the x-axis and "Calories" for
the y-axis.

Include the x and y arguments like this:

x = 'Duration', y = 'Calories'

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show()
In the previous example, we learned that the correlation between "Duration"
and "Calories" was 0.922721, and we concluded with the fact that higher
duration means more calories burned.

By looking at the scatterplot, I will agree.

Let's create another scatterplot, where there is a bad relationship between


the columns, like "Duration" and "Maxpulse", with the correlation 0.009403:

#A scatterplot where there are no relationship between the columns:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')


plt.show()

You might also like