Pandas
Pandas
Pandas: Exploring Data using Series, Exploring Data using DataFrames, Index objects,
Re index, Drop Entry, Selecting Entries, Data Alignment, Rank and Sort
21CSS303T/DS
PANDAS
The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis“.
Pandas allows us to analyze big data and make conclusions based on statistical
theories.
Pandas can clean messy data sets, and make them readable and relevant.
21CSS303T/DS
PANDAS
Pandas are also able to delete rows that are not relevant, or contains wrong
values, like empty or NULL values. This is called cleaning the data.
21CSS303T/DS
PANDAS
Pandas Codebase?
Pandas as pd
alias: In Python alias are an alternate name for referring to the same thing.
Create an alias with the "as" keyword while importing:
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
Pandas Series
What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Example : Create a simple Pandas Series from a list - int, float, string
21CSS303T/DS
PANDAS
Based on the values present in the series, the datatype of the series is decided.
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
Labels
If nothing else is specified, the values are labeled with their index number.
First value has index 0, second value has index 1 etc.
This label can be used to access a specified value.
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
Pandas DataFrames
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
To create Pandas DataFrame from list of lists, you can pass this list of lists as data
argument to "pandas.DataFrame()".
Each inner list inside the outer list is transformed to a row in resulting DataFrame.
21CSS303T/DS
PANDAS
Example : Create DataFrame from List of Lists with Column Names & Index
21CSS303T/DS
PANDAS
Example : Create DataFrame from List of Lists with Different List Lengths
21CSS303T/DS
PANDAS
You can create a DataFrame from Dictionary by passing a dictionary as the data
argument to Data Dictionary.
21CSS303T/DS
PANDAS
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by
everyone.
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
to_string() Method :
21CSS303T/DS
PANDAS
Null Values :
21CSS303T/DS
PANDAS
Shape Method :
Viewing Data :
To see how the data looks, we can use the head () method, which shows just
the first five rows if we put a number as an argument to this method, this will
be the number of the first rows that are listed.
21CSS303T/DS
PANDAS
df.head() Method :
21CSS303T/DS
PANDAS
tail() Method :
The tail() method, which returns the last five rows by default.
21CSS303T/DS
PANDAS
If we want to know the names of the columns or the names of the indexes,
we can use the DataFrame attributes columns and index respectively.
The names of the columns or indexes can be changed by assigning a new list
of the same length to these attributes.
21CSS303T/DS
PANDAS
The values of any DataFrame can be retrieved as a Python array by calling its
values attribute.
21CSS303T/DS
PANDAS
The DataFrames object has a method called info(), that gives you more
information about the data set.
21CSS303T/DS
PANDAS
describe() Method :
If we just want quick statistical information on all the numeric columns in a data
frame, we can use the function describe().
The result shows the count, the mean, the standard deviation, the minimum and
maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the values
in each column or series
21CSS303T/DS
PANDAS
Selecting Data
21CSS303T/DS
PANDAS
Reindexing
21CSS303T/DS
PANDAS
Calling reindex on this Series rearranges the data according to the new index,
introducing missing values if any index values were not already present:
21CSS303T/DS
PANDAS
For ordered data like time series, it may be desirable to do some interpolation or
filling of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values:
21CSS303T/DS
PANDAS
Dropping one or more entries from an axis is easy if you already have an index
array or list without those entries.
drop method will return a new object with the indicated value or values deleted
from an axis:
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
Ranking assigns ranks from one through the number of valid data points in an array.
The rank methods for Series and DataFrame are the place to look; by default rank
breaks ties by assigning each group the mean rank:
21CSS303T/DS
PANDAS
Ranks can also be assigned according to the order in which they’re observed in
the data:
Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead
have been set to 6 and 7 because label 0 precedes label 2 in the data.
You can rank in descending order, too:
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
Slice Operator :
Note : that the slice does not use the index labels as references, but the position
21CSS303T/DS
PANDAS
21CSS303T/DS
PANDAS
If we want to select a subset of columns and rows using the labels as our references
instead of the positions, we can use loc indexing:
Next instruction will return all the rows between the indexes specified in the slice
before the comma, and the columns specified as a list after the comma.
21CSS303T/DS
PANDAS
Data Cleaning :
Data cleaning means fixing bad data in your data set. Bad data could
be:
Empty cells
Data in wrong format
Wrong data
Duplicates
21CSS303T/DS
PANDAS
is null() method :
21CSS303T/DS
PANDAS
Remove Rows :
One way to deal with empty cells is to remove rows that contain empty cells.
dropna() method :
the dropna() method returns a new DataFrame, and will not change the original.
21CSS303T/DS
PANDAS
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty
21CSS303T/DS
PANDAS
The example above replaces all empty cells in the whole Data Frame.
To only replace empty values for one column, specify the column name for
the DataFrame:
21CSS303T/DS
PANDAS
Discovering Duplicates :
Duplicate rows are rows that have been registered more than one time.
By taking a look at our test data set,
we can assume that row 11 and 12
are duplicates.
To discover duplicates, we can
use the duplicated() method.
The duplicated() method returns a
Boolean values for each row:
21CSS303T/DS