0% found this document useful (0 votes)

19 views17 pages

DS (Pandas)

Uploaded by

deepti.u.1228

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views17 pages

DS (Pandas)

Uploaded by

deepti.u.1228

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Pandas

Pandas is a Python library that is used to work with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" refers to both "Panel Data" and "Python Data Analysis"
and was created by Wes McKinney in 2008.

Pandas allow us to analyze big data and make conclusions based on

statistical theories. Pandas can clean messy data sets, and make them
readable and relevant. Relevant data is very important in data science.

What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

output:
0 1
1 7
2 2
dtype: int64

Labels
If nothing else is specified, the values are labeled with their index number.
First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

print(myvar[0]

Create Labels
With the index argument, you can name your own labels.

import pandas as pd
a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"]

print(myvar)

output

x 1
y 7
z 2
dtype: int64

Key/Value Objects as Series

You can also use a key/value object, like a dictionary when creating a Series.

The keys of the dictionary become the labels.

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

output:

day1 420
day2 380
day3 390
dtype: int64

DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional
array, or a table with rows and columns.Data sets in Pandas are usually multi-
dimensional tables, called DataFrames. Series is like a column, a DataFrame
is the whole table.

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]}
df= pd.DataFrame(data)
print(df)
output

calories duration

0 420 50

1 380 40

2 390 45

Locate Row
Pandas use the loc attribute to return one or more specified row(s)

#refer to the row index:

print(df.loc[0])

Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated
files).CSV files contains plain text and is a well know format that can be read
by everyone including Pandas.

C:\Users\LUV\Downloads\data.csv

Load the CSV into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string()) #use to_string() to print the entire DataFrame.

If you have a large DataFrame with many rows, Pandas will only return the
first 5 rows, and the last 5 rows:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with
the pd.options.display.max_rows statement.

Check the number of maximum returned rows:

import pandas as pd

print(pd.options.display.max_rows)

In my system the number is 60, which means that if the DataFrame contains
more than 60 rows, the print(df) statement will return only the headers and
the first and last 5 rows.

You can change the maximum rows number with the same statement.

Increase the maximum number of rows to display the entire

DataFrame:
import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('data.csv')

print(df)

Analyzing DataFrames
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame,
is the head() method.

The head() method returns the headers and a specified number of rows,
starting from the top.

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10)) # printing the first 10 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())
Print the first 5 rows of the DataFrame:
The tail() method returns the headers and a specified number of rows,
starting from the bottom. Print the last 5 rows of the DataFrame:
print(df.tail())

Pandas - Cleaning Data

Data Cleaning
Data cleaning means fixing bad data in your data set.

Bad data could be:

 Empty cells
 Data in wrong format
 Wrong data
 Duplicates

Duration Date Pulse Maxpulse Calories

0 60 '2020/12/01' 110 130 409.1

1 60 '2020/12/02' 117 145 479.0

2 60 '2020/12/03' 103 135 340.0

3 45 '2020/12/04' 109 175 282.4

4 45 '2020/12/05' 117 148 406.0

5 60 '2020/12/06' 102 127 300.0

6 60 '2020/12/07' 110 136 374.0

7 450 '2020/12/08' 104 134 253.3

8 30 '2020/12/09' 109 133 195.1

9 60 '2020/12/10' 98 124 269.0

10 60 '2020/12/11' 103 147 329.3

11 60 '2020/12/12' 100 120 250.7

12 60 '2020/12/12' 100 120 250.7

13 60 '2020/12/13' 106 128 345.3

14 60 '2020/12/14' 104 132 379.3

15 60 '2020/12/15' 98 123 275.0

16 60 '2020/12/16' 98 120 215.2

17 60 '2020/12/17' 100 120 300.0

18 45 '2020/12/18' 90 112 NaN

19 60 '2020/12/19' 103 123 323.0

20 45 '2020/12/20' 97 125 243.0

21 60 '2020/12/21' 108 131 364.2

22 45 NaN 100 119 282.0

23 60 '2020/12/23' 130 101 300.0

24 45 '2020/12/24' 105 132 246.0

25 60 '2020/12/25' 102 126 334.5

26 60 2020/12/26 100 120 250.0

27 60 '2020/12/27' 92 118 241.0

28 60 '2020/12/28' 103 132 NaN

29 60 '2020/12/29' 100 132 280.0

30 60 '2020/12/30' 102 129 380.3

31 60 '2020/12/31' 92 115 243.0

The data set contains some empty cells ("Date" in row 22, and "Calories" in
row 18 and 28).

The data set contains wrong format ("Date" in row 26).

The data set contains wrong data ("Duration" in row 7).

The data set contains duplicates (row 11 and 12).

Empty cells can potentially give you a wrong result when you analyze
data.

Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows
will not have a big impact on the result.

Return a new Data Frame with no empty cells:

By default, the dropna() method returns a new DataFrame, and will

not change the original.

import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())

If you want to change the original DataFrame, use the inplace =

True argument:

import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string()) # Removes all the rows with null value.

Replace Empty Values

Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty
cells.

The fillna() method allows us to replace empty cells with a value:

import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True) # Replace null value with number

130.

To only replace empty values for one column, specify the column name for
the DataFrame.

import pandas as pd

df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True) # Replace the null
values in calories column with the number 130.

import pandas as pd
df = pd.read_csv(r'C:\Users\LUV\OneDrive\Desktop\data.csv')
df.fillna(130, inplace = True)
print(df.to_string())
df.to_csv('C:\\Users\\LUV\\OneDrive\\Desktop\\data.csv')

Replace Using Mean, Median, or Mode

A common way to replace empty cells, is to calculate the mean, median or
mode value of the column.

Pandas uses the mean() median() and mode().methods to calculate the

respective values for a specified column.

#Calculate the MEAN, and replace any empty values with it:

import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)

#Calculate the MEDIAN, and replace any empty values with it:

import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)

#Calculate the MODE, and replace any empty values with it:

import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].mode()[0]

df["Calories"].fillna(x, inplace = True)

Pandas - Cleaning Data of Wrong Format
Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to
analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the
columns into the same format.

Let's try to convert all cells in the 'Date' column into dates.

Pandas has a to_datetime() method for this:

import pandas as pd

df = pd.read_csv('data.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string()

Pandas - Removing Duplicates

By taking a look at our test data set, we can assume that row 11 and 12 are
duplicates.

To discover duplicates, we can use the duplicated() method.

The duplicated() method returns a Boolean values for each row:

print(df.duplicated()) #Returns True for every row that is a

duplicate, otherwise False.

Removing Duplicates
To remove duplicates, use the drop_duplicates() method.

df.drop_duplicates(inplace = True) #Remove all duplicates.

import pandas as pd

# create dataframe

data = {
'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],

'Age': [8, 9, 7, 80, 100],

'Gender': ['M', 'M', 'M', 'F', 'M'],

'Standard': [3, 4, 12, 3, 5]

df = pd.DataFrame(data)

# replace F with M

df.loc[3, 'Gender'] = 'M'

print(df)

Replace Values Based on a Conditionimport pandas as pd

# create dataframe

data = {

'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],

'Age': [8, 9, 7, 80, 100],

'Gender': ['M', 'M', 'M', 'M', 'M'],

'Standard': [3, 4, 12, 3, 5]}

df = pd.DataFrame(data)# replace values based on conditions

for i in df.index:

age_val = df.loc[i, 'Age']

if (age_val > 14) and (age_val%10 == 0):

df.loc[i, 'Age'] = age_val/10

print(df)

Pandas - Data Correlations

Finding Relationships
A great aspect of the Pandas module is the corr() method.

The corr() method calculates the relationship between each column in your
data set.

The examples in this page uses a CSV file called: 'data.csv'.

df.corr() #Show the relationship between the columns.

Duration Pulse Maxpulse Calories

Duration 1.000000 -0.155408 0.009403 0.922721

Pulse -0.155408 1.000000 0.786535 0.025120

Maxpulse 0.009403 0.786535 1.000000 0.203814

Calories 0.922721 0.025120 0.203814 1.000000

The corr() method ignores "not numeric" columns.

Result Explained

The Result of the corr() method is a table with a lot of numbers that
represents how well the relationship is between two columns.

The number varies from -1 to 1.

1 means that there is a 1 to 1 relationship (a perfect correlation), and for this

data set, each time a value went up in the first column, the other one went
up as well.

0.9 is also a good relationship, and if you increase one value, the other will
probably increase as well.

-0.9 would be just as good relationship as 0.9, but if you increase one value,
the other will probably go down.

0.2 means NOT a good relationship, meaning that if one value goes up does
not mean that the other will.

Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000, which
makes sense, each column always has a perfect relationship with itself.
Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more
calories you burn, and the other way around: if you burned a lot of calories,
you probably had a long work out.

Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad
correlation, meaning that we can not predict the max pulse by just looking at
the duration of the work out, and vice versa.

What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a
visualization utility.

Matplotlib was created by John D. Hunter.

Matplotlib is open source and we can use it freely.

Matplotlib is mostly written in python, a few segments are written in C,

Objective-C and Javascript for Platform compatibility.

Pandas - Plotting
Plotting
Pandas uses the plot() method to create diagrams.

We can use Pyplot, a submodule of the Matplotlib library to visualize the

diagram on the screen.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')
df.plot()
plt.show()

Scatter Plot
Specify that you want a scatter plot with the kind argument:

kind = 'scatter'

A scatter plot needs an x- and a y-axis.

In the example below we will use "Duration" for the x-axis and "Calories" for
the y-axis.

Include the x and y arguments like this:

x = 'Duration', y = 'Calories'

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show()
In the previous example, we learned that the correlation between "Duration"
and "Calories" was 0.922721, and we concluded with the fact that higher
duration means more calories burned.

By looking at the scatterplot, I will agree.

Let's create another scatterplot, where there is a bad relationship between

the columns, like "Duration" and "Maxpulse", with the correlation 0.009403:

#A scatterplot where there are no relationship between the columns:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')

plt.show()

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
M.S. Ramaiah Institute of Technology,: (Autonomous Institute Affiliated To Vtu) Bangalore - 560 054
No ratings yet
M.S. Ramaiah Institute of Technology,: (Autonomous Institute Affiliated To Vtu) Bangalore - 560 054
1 page
Question Paper1
No ratings yet
Question Paper1
10 pages
Lab Exercise 2-CS0017
No ratings yet
Lab Exercise 2-CS0017
17 pages
Pandas
No ratings yet
Pandas
41 pages
Pandas Notes
No ratings yet
Pandas Notes
10 pages
Pandas
No ratings yet
Pandas
2 pages
Importing Files Through Pandas
No ratings yet
Importing Files Through Pandas
16 pages
MOD-3 Dap
No ratings yet
MOD-3 Dap
41 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
19 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Python Pandas Demo PDF
100% (2)
Python Pandas Demo PDF
23 pages
2 Pandas
No ratings yet
2 Pandas
22 pages
Lecture 7 Understanding Dataframes in Python and R
No ratings yet
Lecture 7 Understanding Dataframes in Python and R
17 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Pandas Notes
No ratings yet
Pandas Notes
5 pages
Pandas - Removing Duplicates
No ratings yet
Pandas - Removing Duplicates
1 page
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
14 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Pandas
No ratings yet
Pandas
30 pages
Pandas Library
No ratings yet
Pandas Library
5 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Pandas - Datastructures
No ratings yet
Pandas - Datastructures
19 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Pandas Module
No ratings yet
Pandas Module
24 pages
EX - No: 1 Date:: Download Install Explore The Features of Numpy, Scipy, Jupiter, Statsmodels and Pandas Packages
No ratings yet
EX - No: 1 Date:: Download Install Explore The Features of Numpy, Scipy, Jupiter, Statsmodels and Pandas Packages
38 pages
Series and Pandas Methods
No ratings yet
Series and Pandas Methods
5 pages
Notes On Pandas.
No ratings yet
Notes On Pandas.
7 pages
Practical File IP Class 12 2022 23
No ratings yet
Practical File IP Class 12 2022 23
49 pages
Pratical 1: Problem Statement: Solution: Source Code
No ratings yet
Pratical 1: Problem Statement: Solution: Source Code
49 pages
Pratical 1: Problem Statement: Solution: Source Code
No ratings yet
Pratical 1: Problem Statement: Solution: Source Code
49 pages
Pratical 1: Problem Statement: Solution: Source Code
No ratings yet
Pratical 1: Problem Statement: Solution: Source Code
49 pages
Pratical 1: Problem Statement: Solution: Source Code
No ratings yet
Pratical 1: Problem Statement: Solution: Source Code
49 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
Module 6
No ratings yet
Module 6
48 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
Cs Sem V Dav Upc 32347507 Sl. No. Qp. 4432 Dec '23
No ratings yet
Cs Sem V Dav Upc 32347507 Sl. No. Qp. 4432 Dec '23
16 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Python Class - 22
No ratings yet
Python Class - 22
5 pages
10 Minutes To Pandas - Pandas 1.2.4 Documentation
No ratings yet
10 Minutes To Pandas - Pandas 1.2.4 Documentation
18 pages
2-Introduction To Data Cleaning P02
No ratings yet
2-Introduction To Data Cleaning P02
7 pages
SAS For Dummies
From Everand
SAS For Dummies
Chris Hemedinger
No ratings yet
MySQL Crash Course: A Hands-on Introduction to Database Development
From Everand
MySQL Crash Course: A Hands-on Introduction to Database Development
Rick Silva
No ratings yet
Professional Microsoft SQL Server 2014 Integration Services
From Everand
Professional Microsoft SQL Server 2014 Integration Services
Devin Knight
No ratings yet
Remote Access Standard
No ratings yet
Remote Access Standard
3 pages
Brochure - Online Training Programs
No ratings yet
Brochure - Online Training Programs
15 pages
99 FTMK Academic Handbook Ug Terkini
No ratings yet
99 FTMK Academic Handbook Ug Terkini
183 pages
Computer Education (VI-VIII) - FKedits
No ratings yet
Computer Education (VI-VIII) - FKedits
31 pages
AKLABETH
No ratings yet
AKLABETH
22 pages
Python Lab Manual Final
100% (6)
Python Lab Manual Final
88 pages
Question Paper of UNIT-III
No ratings yet
Question Paper of UNIT-III
5 pages
PWM Outputs Enhance Sensor Signal Conditioners
No ratings yet
PWM Outputs Enhance Sensor Signal Conditioners
5 pages
Power Panel PP21: 5.1 Order Data
No ratings yet
Power Panel PP21: 5.1 Order Data
7 pages
Opera Exchange Interface - Communication Vendor Specification
No ratings yet
Opera Exchange Interface - Communication Vendor Specification
25 pages
The LLVM Compiler Framework and Infrastructure
No ratings yet
The LLVM Compiler Framework and Infrastructure
61 pages
Input & Output Worksheet
100% (1)
Input & Output Worksheet
5 pages
Telit NE310L2 Firmware Download Procedure Application Note r0
No ratings yet
Telit NE310L2 Firmware Download Procedure Application Note r0
27 pages
VXP P2 SMR
No ratings yet
VXP P2 SMR
2 pages
XR12-XR16 Block Diagram
No ratings yet
XR12-XR16 Block Diagram
1 page
Multi User Operatting System
No ratings yet
Multi User Operatting System
8 pages
Creative Tech Grade 7 QRTR 1 Exam
No ratings yet
Creative Tech Grade 7 QRTR 1 Exam
5 pages
Physics-Lab-Project-Report
No ratings yet
Physics-Lab-Project-Report
38 pages
Verdi Quickrefpdf
No ratings yet
Verdi Quickrefpdf
7 pages
MicroC2 eCh10L02Mem Const Var DataTypes
No ratings yet
MicroC2 eCh10L02Mem Const Var DataTypes
44 pages
C++ Important Questions Mid-Ii
No ratings yet
C++ Important Questions Mid-Ii
1 page
Rishab Goyal: Contact Information
No ratings yet
Rishab Goyal: Contact Information
2 pages
Summary Hadoop
No ratings yet
Summary Hadoop
2 pages
Echo Cancellation Thesis
100% (3)
Echo Cancellation Thesis
5 pages
NSM Theory Bhoomi
No ratings yet
NSM Theory Bhoomi
9 pages
Beacon
No ratings yet
Beacon
2 pages
DC20
No ratings yet
DC20
3 pages
CST201 DATA STRUCTURES, December 2020
No ratings yet
CST201 DATA STRUCTURES, December 2020
2 pages