0% found this document useful (0 votes)
12 views54 pages

Pandas

Pandas is a Python library designed for data manipulation and analysis, providing tools for cleaning, exploring, and analyzing datasets. It includes data structures like Series and DataFrames, which facilitate handling one-dimensional and two-dimensional data, respectively. Key functionalities include data cleaning, statistical analysis, and methods for sorting, ranking, and selecting data.

Uploaded by

iamjasper2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views54 pages

Pandas

Pandas is a Python library designed for data manipulation and analysis, providing tools for cleaning, exploring, and analyzing datasets. It includes data structures like Series and DataFrames, which facilitate handling one-dimensional and two-dimensional data, respectively. Key functionalities include data cleaning, statistical analysis, and methods for sorting, ranking, and selecting data.

Uploaded by

iamjasper2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

PANDAS

Pandas: Exploring Data using Series, Exploring Data using DataFrames, Index objects,
Re index, Drop Entry, Selecting Entries, Data Alignment, Rank and Sort

21CSS303T/DS
PANDAS

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data

Analysis“.

Pandas allows us to analyze big data and make conclusions based on statistical

theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

21CSS303T/DS
PANDAS

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?


What is average value?
Max value?
Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong
values, like empty or NULL values. This is called cleaning the data.

21CSS303T/DS
PANDAS

Pandas Codebase?

import pandas mydataset = { ‘cars’: [“BMW”, “Volvo”, “Ford”], ‘passings’: [3, 7, 2] }


myvar = pandas.DataFrame(mydataset)
print(myvar)

Pandas as pd

Pandas is usually imported under the pd alias.

alias: In Python alias are an alternate name for referring to the same thing.
Create an alias with the "as" keyword while importing:

### Syntax : import pandas as pd


Now the Pandas package can be referred to as pd instead of pandas.

21CSS303T/DS
PANDAS

21CSS303T/DS
PANDAS

For Checking Pandas Version

The version string is stored under __version__ attribute.

21CSS303T/DS
PANDAS

Pandas Series

What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.

Example : Create a simple Pandas Series from a list - int, float, string

21CSS303T/DS
PANDAS

Based on the values present in the series, the datatype of the series is decided.

21CSS303T/DS
PANDAS

21CSS303T/DS
PANDAS

Labels

If nothing else is specified, the values are labeled with their index number.
First value has index 0, second value has index 1 etc.
This label can be used to access a specified value.

21CSS303T/DS
PANDAS

Example : Return the second value of the Series:

21CSS303T/DS
PANDAS

Create you own labels

21CSS303T/DS
PANDAS

Example : Return the value of “y”:

21CSS303T/DS
PANDAS

Pandas DataFrames

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or


a table with rows and columns.
In Python Pandas module, DataFrame is a very basic and important type.

To create a DataFrame from different sources of data or other Python


datatypes, we can use "DataFrame”.

Syntax of DataFrame() class :


DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Example Create an Empty DataFrame

To create an empty DataFrame, pass no arguments to pandas.DataFrame() class.


In this example, we create an empty DataFrame and print it to the console
output.
21CSS303T/DS
PANDAS

Example : Create a simple Pandas DataFrame

21CSS303T/DS
PANDAS

Example Create a simple Pandas DataFrame with Lables - Index

21CSS303T/DS
PANDAS

Create Pandas DataFrame from List of Lists

To create Pandas DataFrame from list of lists, you can pass this list of lists as data
argument to "pandas.DataFrame()".
Each inner list inside the outer list is transformed to a row in resulting DataFrame.

Example : Create DataFrame from List of Lists

21CSS303T/DS
PANDAS

Example : Create DataFrame from List of Lists with Column Names & Index

21CSS303T/DS
PANDAS

Example : Create DataFrame from List of Lists with Different List Lengths

21CSS303T/DS
PANDAS

Create Pandas DataFrame from Python Dictionary

You can create a DataFrame from Dictionary by passing a dictionary as the data
argument to Data Dictionary.

Example : Create DataFrame from Dictionary

21CSS303T/DS
PANDAS

Pandas Read CSV

A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by
everyone.

21CSS303T/DS
PANDAS

21CSS303T/DS
PANDAS

21CSS303T/DS
PANDAS

to_string() Method :

to_string() is used to print the entire DataFrame.

21CSS303T/DS
PANDAS
Null Values :

21CSS303T/DS
PANDAS

Shape Method :

Viewing Data :

To see how the data looks, we can use the head () method, which shows just
the first five rows if we put a number as an argument to this method, this will
be the number of the first rows that are listed.

21CSS303T/DS
PANDAS

df.head() Method :

21CSS303T/DS
PANDAS

tail() Method :

The tail() method, which returns the last five rows by default.

21CSS303T/DS
PANDAS

Names of the columns or the names of the indexes :

If we want to know the names of the columns or the names of the indexes,
we can use the DataFrame attributes columns and index respectively.
The names of the columns or indexes can be changed by assigning a new list
of the same length to these attributes.

21CSS303T/DS
PANDAS

The values of any DataFrame can be retrieved as a Python array by calling its
values attribute.

21CSS303T/DS
PANDAS

Info About the Data:

The DataFrames object has a method called info(), that gives you more
information about the data set.

21CSS303T/DS
PANDAS

describe() Method :

If we just want quick statistical information on all the numeric columns in a data
frame, we can use the function describe().
The result shows the count, the mean, the standard deviation, the minimum and
maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the values
in each column or series

21CSS303T/DS
PANDAS

Selecting Data

21CSS303T/DS
PANDAS

Reindexing

An important method on pandas objects is reindex, which means to create a


new object with the data conformed to a new index.

21CSS303T/DS
PANDAS

Calling reindex on this Series rearranges the data according to the new index,
introducing missing values if any index values were not already present:

21CSS303T/DS
PANDAS

For ordered data like time series, it may be desirable to do some interpolation or
filling of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values:

21CSS303T/DS
PANDAS

Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index
array or list without those entries.

drop method will return a new object with the indicated value or values deleted
from an axis:

21CSS303T/DS
PANDAS

21CSS303T/DS
PANDAS

Sorting and Ranking


Sorting a dataset by some criterion is another important built-in operation. To
sort lexicographically by row or column index, use the sort_index method,
which returns a new, sorted object:

21CSS303T/DS
PANDAS

21CSS303T/DS
PANDAS

To sort a Series by its values, use its sort_values method:

21CSS303T/DS
PANDAS

Ranking assigns ranks from one through the number of valid data points in an array.
The rank methods for Series and DataFrame are the place to look; by default rank
breaks ties by assigning each group the mean rank:

21CSS303T/DS
PANDAS

Ranks can also be assigned according to the order in which they’re observed in
the data:

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead
have been set to 6 and 7 because label 0 precedes label 2 in the data.
You can rank in descending order, too:

21CSS303T/DS
PANDAS

21CSS303T/DS
PANDAS

DataFrame can compute ranks over the rows or the columns:

21CSS303T/DS
PANDAS

Slice Operator :

If we want to select a subset of rows from a DataFrame, we can do so by indicating


a range of rows separated by : inside the square brackets.
This is commonly known as a slice of rows.
Next instruction returns the slice of rows from the 9th to the 13th position.

Note : that the slice does not use the index labels as references, but the position

21CSS303T/DS
PANDAS

21CSS303T/DS
PANDAS
If we want to select a subset of columns and rows using the labels as our references
instead of the positions, we can use loc indexing:
Next instruction will return all the rows between the indexes specified in the slice
before the comma, and the columns specified as a list after the comma.

21CSS303T/DS
PANDAS

Pandas - Cleaning Data

Data Cleaning :
Data cleaning means fixing bad data in your data set. Bad data could
be:
Empty cells
Data in wrong format
Wrong data
Duplicates

21CSS303T/DS
PANDAS

is null() method :

How many null values

21CSS303T/DS
PANDAS

Remove Rows :

One way to deal with empty cells is to remove rows that contain empty cells.

dropna() method :

the dropna() method returns a new DataFrame, and will not change the original.

21CSS303T/DS
PANDAS

Replace Empty Values :

Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty

cells. fillna() method :

The fillna() method allows us to replace

empty cells with a value:

21CSS303T/DS
PANDAS

Replace Only For Specified Columns

The example above replaces all empty cells in the whole Data Frame.
To only replace empty values for one column, specify the column name for
the DataFrame:

21CSS303T/DS
PANDAS

Discovering Duplicates :

Duplicate rows are rows that have been registered more than one time.
By taking a look at our test data set,
we can assume that row 11 and 12
are duplicates.
To discover duplicates, we can
use the duplicated() method.
The duplicated() method returns a
Boolean values for each row:

21CSS303T/DS

You might also like