Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
Python Lab
Pandas
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and
manipulating data.
The name "Pandas" has a reference to both "Panel Data", and
"Python Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based
on statistical theories.
Pandas can clean messy data sets, and make them readable and
relevant.
Relevant data is very important in data science.
Data Science: is a branch of computer science where we study how
to store, use and analyze data for deriving information from it.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
• Is there a correlation between two or more columns?
• What is average value?
Python AI-IMCA SEM-2 Prof Nirmit Shah 1
Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
• Max value?
• Min value?
Pandas are also able to delete rows that are not relevant, or
contains wrong values, like empty or NULL values. This is
called cleaning the data.
Where is the Pandas Codebase?
The source code for Pandas is located at this github
repository https://github.com/pandas-dev/pandas
pip install pandas
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone
including Pandas.
In our examples we will be using a CSV file called 'data.csv'.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Tip: use to_string() to print the entire DataFrame.
If you have a large DataFrame with many rows, Pandas will only return the first 5 rows,
and the last 5 rows:
Python AI-IMCA SEM-2 Prof Nirmit Shah 2
Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with
the pd.options.display.max_rows statement.
Example
Check the number of maximum returned rows:
import pandas as pd
print(pd.options.display.max_rows)
Example
Increase the maximum number of rows to display the entire DataFrame:
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is
the head() method.
The head() method returns the headers and a specified number of rows, starting from
the top.
ExampleGet your own Python Server
Get a quick overview by printing the first 10 rows of the DataFrame:
Python AI-IMCA SEM-2 Prof Nirmit Shah 3
Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Example
Print the first 5 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the
bottom.
Example
Print the last 5 rows of the DataFrame:
print(df.tail())
Info About the Data
The DataFrames object has a method called info(), that gives you more information
about the data set.
Example
Print information about the data:
print(df.info())
Python AI-IMCA SEM-2 Prof Nirmit Shah 4
Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:
ExampleGet your own Python Server
Returns True for every row that is a duplicate, otherwise False:
print(df.duplicated())
Let Learn Panda with Small example
Create CSV using following data
person salary country
0 A 40000 USA
1 B 32000 Brazil
2 C 45000 Italy
3 D 54000 USA
4 E 72000 USA
5 F 62000 Brazil
6 G 92000 Italy
7 H 55000 USA
8 I 35000 Italy
9 J 48000 Brazil
Practical 1 : Use Pandas to Calculate Stats from an Imported CSV File
For the final step, the goal is to calculate the following statistics using the Pandas
package:
• Mean salary
• Total sum of salaries
• Maximum salary
• Minimum salary
• Count of salaries
• Median salary
Python AI-IMCA SEM-2 Prof Nirmit Shah 5
Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
• Standard deviation of salaries
• Variance of of salaries
Sol:
import pandas as pd
df = pd.read_csv(r'C:\Users\Ron\Desktop\stats.csv')
# block 1 - simple stats
mean1 = df['salary'].mean()
sum1 = df['salary'].sum()
max1 = df['salary'].max()
min1 = df['salary'].min()
count1 = df['salary'].count()
median1 = df['salary'].median()
std1 = df['salary'].std()
var1 = df['salary'].var()
# block 2 - group by
groupby_sum1 = df.groupby(['country']).sum()
groupby_count1 = df.groupby(['country']).count()
# print block 1
Python AI-IMCA SEM-2 Prof Nirmit Shah 6
Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
print('mean salary: ' + str(mean1))
print('sum of salaries: ' + str(sum1))
print('max salary: ' + str(max1))
print('min salary: ' + str(min1))
print('count of salaries: ' + str(count1))
print('median salary: ' + str(median1))
print('std of salaries: ' + str(std1))
print('var of salaries: ' + str(var1))
# print block 2
print('sum of values, grouped by the country: ' + str(groupby_sum1))
print('count of values, grouped by the country: ' + str(groupby_count1))
Pandas - Plotting
Plotting
Pandas uses the plot() method to create diagrams.
We can use Pyplot, a submodule of the Matplotlib library
to visualize the diagram on the screen.
pandas.DataFrame.plot
DataFrame.plot(*args, **kwargs)[source]
Make plots of Series or DataFrame.
Python AI-IMCA SEM-2 Prof Nirmit Shah 7
Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
Uses the backend specified by the option plotting.backend. By default,
matplotlib is used.
Parameters:
dataSeries or DataFrame
The object for which the method is called.
xlabel or position, default None
Only used if data is a DataFrame.
ylabel, position or list of label, positions, default None
Allows plotting of one column versus another. Only used if data is a
DataFrame.
kindstr
The kind of plot to produce:
‘line’ : line plot (default)
•
• ‘bar’ : vertical bar plot
• ‘barh’ : horizontal bar plot
• ‘hist’ : histogram
• ‘box’ : boxplot
• ‘kde’ : Kernel Density Estimation plot
• ‘density’ : same as ‘kde’
• ‘area’ : area plot
• ‘pie’ : pie plot
• ‘scatter’ : scatter plot (DataFrame only)
• ‘hexbin’ : hexbin plot (DataFrame only)
axmatplotlib axes object, default None
An axes of the current figure.
Python AI-IMCA SEM-2 Prof Nirmit Shah 8
Parul Institute of Computer Application
Faculty Of IT and Computer Science
PARUL UNIVERSITY
ExampleGet your own Python Server
Import pyplot from Matplotlib and visualize our DataFrame:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()
Python AI-IMCA SEM-2 Prof Nirmit Shah 9