Introduction To Pandas
Introduction To Pandas
pandas
Introduction to Pandas
• Pandas is an open-source, BSD-licensed library providing
high-performance, easy-to-use data structures and data
analysis tools for the Python programming language
Why
• Suppose we have been given an input file with employee
details like emp_name, emp_salary, emp_department, ..
• We need to find the sum of employees’ salaries for each
department (or sum of salaries department wise)
• Solution 1: Only Python code
• Solution 2: Using Pandas
1
2
Pandas data
structure
• Series:
• Pandas series is a one-
dimensional array with
index labels, more in
technical term, these labels
are also referred to as axis
index. So, in other words,
the pandas series is the
collection of objects in one
dimension with an axis
index
Pandas data structure
• DataFrame
• Pandas DataFrame is the data structure with two-dimensional labels,
one is axis index (row label), and “the second is axis column (column
label). We can think of it as a table in Excel/SpreadSheet where data
is organized in rows and columns
• In pandas library, we have the function to create DataFrame:
Pandas.DataFrame(data,index=index,columns=[column(s)])
Pandas data
structure
Loading data from
files
• Loading the data from CSV into
DataFrame:
• CSV is the abbreviation of
Comma Separated Values, so
typically, the CSV file contains
the common separated values
• Loading data from CSV file with a header into a
DataFrame
Loading data • The read_csv() function automatically picked up
the first line as header and assigned the
from files column names in DataFrame accordingly.
Following is the snapshot of a CSV with a
header.
• Loading data from CSV file without header into a DataFrame:
• If the CSV file doesn’t have a header, we have to pass header or
column details as an argument during calling the read_csv()
Loading data function to load the data from CSV to a data frame. The function
will be like – Pandas.read_csv(<Input_CSV_file_Path>,names =
[col1,col2…coln]). If neither file has a header nor applies the
from files values to the names keyword, then by default, pandas will
assign the first row’s value(s) as a column name(s) to the
DataFrame
Loading data from files
• Loading the data from excel file into DataFrame
• To load data from a Microsoft Excel file into a DataFrame, we have the
read_excel() function:
pd.read_excel(<excel_file_path>,sheet_name=<excel_sheet_name>)
• Loading the data from JSON file
into DataFrame:
• A JSON file format is the
commonly used file format
across the system and
platforms. It organizes the data
in key-value pairs and the
order’s list
DataFrame operations
DataFrame operations
DataFrame Information
• Using the info() method.
• This method returns the number of rows and columns, the data types of
each column, and the memory usage of the DataFrame
• df.shape
• returns a tuple representing the number
of rows and columns
• df.columns
• returns an Index object containing the
column label
• df.index
• returns an Index object containing the
index labels
• df.describe()
• returns a summary of the count, mean,
standard deviation, minimum, and
maximum of each numerical column
Indexing and selection
• .loc: This attribute is used to access a group of rows and
columns by labels. It is primarily label based, but may also be
used with a boolean array.
• # Selecting rows by label
• df.loc[1:3, ['name’,'age’]]
• .iloc: This attribute is used to access a group of rows and
columns by index. It is primarily index based, but may also be
used with a boolean array.
• # Selecting rows by index
• df.iloc[1:3, [0, 1]]
Indexing and selection
• .at: This method is used to access a single value in the
DataFrame by its label. It is faster than .loc for accessing a
single value.
• # Selecting a single value by label
• df.at[1,'name’]
• .iat: This method is used to access a single value in the
DataFrame by its index. It is faster than .iloc for accessing a
single value.
• # Selecting a single value by index
• df.iat[1, 0]
Indexing and selection
• .ix: This attribute is used to access a group of rows and
columns by either labels or index. However, it is now
deprecated and it is recommended to use .loc and .iloc instead.
• Boolean Indexing: This method is used to filter a DataFrame
based on a boolean condition. It returns a new DataFrame
containing only the rows that meet the specified condition.
• # Filtering DataFrame based on condition
• df[df['age'] > 25]
Indexing and selection
• .query(): This method is used to filter a DataFrame based on a
query expression. It is similar to Boolean indexing but it allows
for more complex queries.
• # Filtering DataFrame based on query
• df.query('age > 25 and country == "UK"')
Data cleaning and transformation
• .drop(): This method is used to remove rows or columns from a
DataFrame. You can specify the axis (0 for rows, 1 for columns) and
the labels or indexes of the rows or columns to be removed.
• # Dropping a column
• df.drop('age',axis=1)
• .fillna(): This method is used to fill missing values in a DataFrame
with a specified value or method. For example, you can use 'ffill' or
'bfill' to fill missing values with the previous or next value,
respectively.
• # Filling missing values with 0
• df.fillna(0)
Data cleaning and transformation
• .replace(): This method is used to replace specific values in a
DataFrame with a different value. You can specify the values to
be replaced and the replacement value.
• # Replacing specific values
• df.replace({'USA': 'United States’, 'UK': 'United Kingdom’})
• .rename(): This method is used to rename columns or indexes
in a DataFrame. You can specify a dictionary of old and new
names or a function to determine the new names.
• # Renaming columns
• df.rename(columns={'name': 'full_name'})
Data cleaning and transformation
• .map(): This method is used to apply a function to each element
in a column or series. You can specify a function that takes one
input and returns one output.
• # Applying function to column
• df['age'] = df['age'].map(lambda x: x*2)
• df.head()
• .apply(): This method is used to apply a function to each row or
column in a DataFrame. You can specify a function that takes a
series or DataFrame as input and returns one output.
Data explorations
• df.value_counts(): This function returns the frequency counts for
each unique value in a column.
• # Viewing the frequency counts for a column
• df['column_name'].value_counts()
Data explorations
• df.plot(): This function is used to create a variety of plots,
including line, bar, and histogram plots, for the DataFrame. You
can specify the type of plot, the x and y columns, and various
other plot options.
Data explorations
• df.corr(): This function is used to compute pairwise correlation
of columns in a DataFrame.
• # Viewing correlation between columns
• df.corr()
Merging and joining data