Pandas
Pandas
Introduction
Pandas library of python is very useful for the manipulation of mathematical, data and is
widely used in the field of machine learning for data analysis.
Why Pandas
• Intrinsic Data alignment.
• Data Operation Functions
• Functions for handling missing data
• Data standardization functions
• Data Structures handling major use cases.
Pandas Features
• Powerful data Structure
• Fast and efficient data wrangling
• Easy data aggregation and transformation
• Tools for reading/ Writing data
• Intelligent and automated data alignment
• High performance merging and joining of data sets
2. Technical Setup
Series:
A series is a sequence of data. A series is a one-dimensional array of indexed data. However,
a Series does not have a column name, it only has one overall name. Use Series () function.
• One-dimensional labeled array.
• Support multiple data types
Syntax:
S = pd.Series(data, index = [index])
DataFrames:
Two-Dimensional data structure, like two-dimensional array, or a table with rows and columns.
Use DataFrame () function.
• Two-dimensional labeled array.
• Support multiple data types
• Input cab be a Series
• Input can be another DataFrame.
type (DataFrame) : pandas.core.frame. DataFrame (check dataframe object)
Index in DataFrame: The list of row labels used in a DataFrame is known as an Index.
#df=pd.DataFrame({'Nepal' : ['nepal is','beautiful', 'country.'], 'Kathmandu' : ['Kathmand
u is','capital','of nepal']},
index=['A', 'B','c'])
df = pd.read_csv(“file location”)
Data Input
Functions Description
read_csv () Read CSV file
read_json() Read JSON file
read_htm() Read HTML file
read_xml() Read XML file
read_sql() Read SQL file
read_excel() Read Excel file
to_csv(“file name”) Save DataFrame in CSV file format.
Shape:
The Shape attribute returns a tuple. Representing rows and columns the dimensionality of the
DataFrame.
#DataFrame.shape
E.g. df.shape
Out: (rows, columns)
#df.shape[0]
Out: display rows
#df.shape[1]
Out: display columns
head () and tail() :
#DataFrame.head(n)
Return first n rows of Dataframe.
note: if you not pass any number, display first five rows.
#DataFrame.tail(n)
Return last N rows of Dataframe.
note: if you not pass any number, display last five rows.
info()
info() provides a summary of the data frame, including the number of entries, the data type and
the number of non-null entries for each series in the data frame.
#DataFrame.info ()
5. Basic Analysis
value_counts ():
value_counts () method is very useful in pandas. It returns a series object, counting all the
unique values in DataFrame. Returns a object containing counts of unique values.
By default, results are in descending order so first element is most frequently occurring
element.
#Series.value_counts (normalize = False, sort=True, ascending=False, bins=None,
drope=True)
→ you can use above parameters as your needs.
sort_values():
#Series.sort_values(axis=0, ascending=True, inplace=False, kind='quicksort',na_position='last’)
➔ sort values along either axis.
#DataFrame.sort_values(by, axis=0, ascending= True, inplace=False,
kind='quicksort',na_position='last)
Boolean Indexing:
➔ Boolean vectors can be used to filter data.
Operator Symbol
AND &
OR |
NOT ~
EQUAL-TO ==
loc[]:
➔ DataFrame.loc [] / Dataframe.Series.loc []
➔ loc[] will raise a KeyError when the items are not found
iloc[]:
➔ DataFrame.iolc[]
6. GroupBy
7. Reshaping
stack():
Pivot a level of the column labels, returning a DataFrame or Series, with a new innermost level of
row labels.
Unstack():
#DataFrame.unstack(level=-1, full_value=None)
➔ Pivot a level of the index labels, returning a DataFrame having a new level of columns
labels.
➔ If the index is not a multi-Index, the output will be a Series-the level involved will