Python Unit Iv - Pandas
Python Unit Iv - Pandas
Lecture Notes
UNIT IV - DATA MANIPULATION WITH PANDAS
Prepared By:
Dr. M. Shanmugam
Mr. D. Rajesh
Mrs. M. Hemalatha
Mrs. S. Deeba
Ms. Amala Margret
PART- A (2 Marks)
1. What is pandas in Python? (K1)
Pandas is an open-source Python library with powerful and built-in methods to efficiently clean,
analyze, and manipulate datasets. Developed by Wes McKinney in 2008, this powerful package
can easily blend with various other data science modules in Python.
Pandas is built on top of the NumPy library, i.e., its data structures Series and DataFrame are
the upgraded versions of NumPy arrays.
2. How do you access the top 6 rows and last 7 rows of a pandas DataFrame?
The head() method in pandas is used to access the initial rows of a DataFrame, and tail() method
is used to access the last rows.
To access the top 6 rows: dataframe_name.head(6)
To access the last 7 rows: dataframe_name.tail(7)
3. Why doesn’t DataFrame.shape have parenthesis?
In pandas, shape is an attribute and not a method. So, you should access it without parentheses.
DataFrame.shape outputs a tuple with the number of rows and columns in a DataFrame.
4. What is the difference between Series and DataFrame?
DataFrame: The pandas DataFrame will be in tabular format with multiple rows and columns
where each column can be of different data types.
Series: The Series is a one-dimensional labeled array that can store any data type, but all of its
values should be of the same data type. The Series data structure is more like single column of
a DataFrame.
The Series data structure consumes less memory than a DataFrame. So, certain data
manipulation tasks are faster on it. However, DataFrame can store large and complex datasets,
while Series can handle only homogeneous data. So, the set of operations you can perform on
DataFrame is significantly higher than on Series data structure.
5. What is an index in pandas?
The index is a series of labels that can uniquely identify each row of a DataFrame. The index
can be of any datatype like integer, string, hash, etc.,
df.index prints the current row indexes of the DataFrame df.
Intermediate pandas Interview Questions
These questions will be a bit more challenging, and you are more likey to encounter them in
roles requiring previous experience using pandas.
6. What is Multi indexing in pandas?
Index in pandas uniquely specifies each row of a DataFrame. We usually pick the column that
can uniquely identify each row of a DataFrame and set it as the index. But what if you don’t
have a single column that can do this?
For example, you have the columns “name”, “age”, “address”, and “marks” in a DataFrame.
Any of the above columns may not have unique values for all the different rows and are unfit
as indexes.
However, the columns “name” and “address” together may uniquely identify each row of the
DataFrame. So you can set both columns as the index. Your DataFrame now has a multi-index
or hierarchical index.
7. Explain pandas Reindexing
Reindexing in pandas lets us create a new DataFrame object from the existing DataFrame with
the updated row indexes and column labels.
You can provide a set of new indexes to the function DataFrame.reindex() and it will create a
new DataFrame object with given indexes and take values from the actual DataFrame.
If values for these new indexes were not present in the original DataFrame, the function fills
those positions with the default nulls. However, we can alter the default value NaN to whatever
value we want them to fill with.
Here is the sample code:
Create a DataFrame df with indexes:
import pandas as pd
data = [['John', 50, 'Austin', 70],
['Cataline', 45 , 'San Francisco', 80],
['Matt', 30, 'Boston' , 95]]
columns = ['Name', 'Age', 'City', 'Marks']
#row indexes
idx = ['x', 'y', 'z']
df = pd.DataFrame(data, columns=columns, index=idx)
print(df)
Reindex with new set of indexes:
new_idx = ['a', 'y', 'z']
new_df = df.reindex(new_idx)
print(new_df)
The new_df has values from the df for common indexes ( ‘y’ and ‘z’), and the new index ‘a’ is
filled with the default NaN.
8. What is the difference between loc and iloc?
Both loc and the iloc methods in pandas are used to select subsets of a DataFrame. Practically,
these are widely used for filtering DataFrame based on conditions.
We should use the loc method to select data using actual labels of rows and columns, while the
iloc method is used to extract data based on integer indices of rows and columns.
9. Show two different ways to create a pandas DataFrame
Using Python Dictionary:
import pandas as pd
data = {'Name': ['John', 'Cataline', 'Matt'],
'Age': [50, 45, 30],
'City': ['Austin', 'San Francisco', 'Boston'],
'Marks' : [70, 80, 95]}
df = pd.DataFrame(data)
Using Python Lists:
import pandas as pd
data = [['John', 25, 'Austin',70],
['Cataline', 30, 'San Francisco',80],
['Matt', 35, 'Boston',90]]
columns = ['Name', 'Age', 'City', 'Marks']
df = pd.DataFrame(data, columns=columns)
10. How do you get the count of all unique values of a categorical column in a DataFrame?
The function Series.value_counts() returns the count of each unique value of a series or a
column.
Example:
We have created a DataFrame df that contains a categorical column named ‘Sex’, and
ran value_counts() function to see the count of each unique value in that column.
import pandas as pd
data = [['John', 50, 'Male', 'Austin', 70],
['Cataline', 45 ,'Female', 'San Francisco', 80],
['Matt', 30 ,'Male','Boston', 95]]
# Column labels of the DataFrame
columns = ['Name','Age','Sex', 'City', 'Marks']
# Create a DataFrame df
df = pd.DataFrame(data, columns=columns)
df['Sex'].value_counts()
11. How Do you optimize the performance while working with large datasets in pandas?
Load less data: While reading data using pd.read_csv(), choose only the columns you need
with the “usecols” parameter to avoid loading unnecessary data. Plus, specifying the
“chunksize” parameter splits the data into different chunks and processes them sequentially.
Avoid loops: Loops and iterations are expensive, especially when working with large datasets.
Instead, opt for vectorized operations, as they are applied on an entire column at once, making
them faster than row-wise iterations.
Use data aggregation: Try aggregating data and perform statistical operations because
operations on aggregated data are more efficient than on the entire dataset.
Use the right data types: The default data types in pandas are not memory efficient. For
example, integer values take the default datatype of int64, but if your values can fit in int32,
adjusting the datatype to int32 can optimize the memory usage.
Parallel processing: Dask is a pandas-like API to work with large datasets. It utilizes multiple
processes of your system to parallely execute different data tasks.
12. What is the difference between Join and Merge methods in pandas?
Join: Joins two DataFrames based on their index. However, there is an optional argument ‘on’
to explicitly specify if you want to join based on columns. By default, this function performs
left join.
Syntax: df1.join(df2)
Merge: The merge function is more versatile, allowing you to specify the columns on which
you want to join the DataFrames. It applies inner join by default, but can be customized to use
different join types like left, right, outer, inner, and cross.
Syntax: pd.merge(df1, df2, on=”column_names”)
13. What is Timedelta?
Timedelta represents the duration ie., the difference between two dates or times, measured in
units as days, hours, minutes, and seconds.
14. What is the difference between append and concat methods?
We can use the concat method to combine DataFrames either along rows or columns. Similarly,
append is also used to combine DataFrames, but only along the rows.
With the concat function, you have the flexibility to modify the original DataFrame using the
“inplace” parameter, while the append function can't modify the actual DataFrame, instead it
creates a new one with the combined data.
15. How do you read Excel files to CSV using pandas?
First, we should use the read_excel() function to pull in the Excel data to a variable. Then, just
apply the to_csv() function for a seamless conversion.
Here is the sample code:
import pandas as pd
#input your excel file path into the read_excel() function.
excel_data = pd.read_excel("/content/sample_data/california_housing_test.xlsx")
excel_data.to_csv("CSV_data.csv", index = None, header=True)
16. How do you sort a DataFrame based on columns?
We have the sort_values() method to sort the DataFrame based on a single column or multiple
columns.
Syntax: df.sort_values(by=[“column_names”])
Example code:
import pandas as pd
data = [['John', 50, 'Male', 'Austin', 70],
['Cataline', 45 ,'Female', 'San Francisco', 80],
['Matt', 30 ,'Male', 'Boston', 95],
['Oliver',35,'Male', 'New york', 65]]
# Column labels of the DataFrame
columns = ['Name','Age','Sex', 'City', 'Marks']
# Create a DataFrame df
df = pd.DataFrame(data, columns=columns)
# Sort values based on ‘Age’ column
df.sort_values(by=['Age'])
df.head()`
17. Show two different ways to filter data
To create a DataFrame:
import pandas as pd
data = {'Name': ['John', 'Cataline', 'Matt'],
'Age': [50, 45, 30],
'City': ['Austin', 'San Francisco', 'Boston'],
'Marks' : [70, 80, 95]}
# Create a DataFrame df
df = pd.DataFrame(data)
Method 1: Based on conditions
new_df = df[(df.Name == "John") | (df.Marks > 90)]
print (new_df)
Method 2: Using query function
df.query('Name == "John" or Marks > 90')
print (new_df)
18. How do you aggregate data and apply some aggregation function like mean or sum on it?
The groupby function lets you aggregate data based on certain columns and perform operations
on the grouped data. In the following code, the data is grouped on the column ‘Name’ and the
mean ‘Marks’ of each group is calculated.
import pandas as pd
# Create a DataFrame
data = {
'Name': ['John', 'Matt', 'John', 'Matt', 'Matt', 'Matt'],
'Marks': [10, 20, 30, 15, 25, 18]
}
# Create a DataFrame df
df = pd.DataFrame(data)
# mean marks of John and Matt
print(df.groupby('Name').mean())
19. How can you create a new column derived from existing columns?
We can use apply() method to derive a new column by performing some operations on existing
columns.
The following code adds a new column named ‘total’ to the DataFrame. This new column holds
the sum of values from the other two columns.
Example code:
import pandas as pd
# Create a DataFrame
data = {
'Name': ['John', 'Matt', 'John', 'Cateline'],
'math_Marks': [18, 20, 19, 15],
'science_Marks': [10, 20, 15, 12]
}
# Create a DataFrame df
df = pd.DataFrame(data)
df['total'] = df.apply(lambda row : row["math_Marks"] + row["science_Marks"], axis=1)
print(df)
20. How do you handle null or missing values in pandas?
You can use any of the following three methods to handle missing values in pandas:
dropna() – the function removes the missing rows or columns from the DataFrame.
fillna() – fill nulls with a specific value using this function.
interpolate() – this method fills the missing values with computed interpolation values. The
interpolation technique can be linear, polynomial, spline, time, etc.,
21. Difference between fillna() and interpolate() methods
fillna() –
fillna() fills the missing values with the given constant. Plus, you can give forward-filling or
backward-filling inputs to its ‘method’ parameter.
interpolate() –
By default, this function fills the missing or NaN values with the linear interpolated values.
However, you can customize the interpolation technique to polynomial, time, index, spline,
etc., using its ‘method’ parameter.
The interpolation method is highly suitable for time series data, whereas fillna is a more generic
approach.
22. What is Resampling?
Resampling is used to change the frequency at which time series data is reported. Imagine you
have monthly time series data and want to convert it into weekly data or yearly, this is where
resampling is used.
Converting monthly to weekly or daily data is nothing but upsampling. Interpolation techniques
are used to increase the frequencies here.
In contrast, converting monthly to yearly data is termed as downsampling, where data
aggregation techniques are applied.
23. How do you perform one-hot encoding using pandas?
We perform one hot encoding to convert categorical values to numeric ones so that can be fed
to the machine learning algorithm.
import pandas as pd
data = {'Name': ['John', 'Cateline', 'Matt', 'Oliver'],
'ID': [1, 22, 23, 36]}
df = pd.DataFrame(data)
#one hot encoding
new_df = pd.get_dummies(df.Name)
new_df.head()
24. How do you create a line plot in pandas?
To draw a line plot, we have a plot function in pandas.
import pandas as pd
data = {'units': [1, 2, 3, 4, 5],
'price': [7, 12, 8, 13, 16]}
# Create a DataFrame df
df = pd.DataFrame(data)
df.plot(x='units', y='price')
25. What is the pandas method to get the statistical summary of all the columns in a
DataFrame?
df.describe()
This method returns stats like mean, percentile values, min, max, etc., of each column in the
DataFrame.
26. What is Rolling mean?
Rolling mean is also referred to as moving average because the idea here is to compute the
mean of data points for a specified window and slide the window throughout the data. This will
lessen the fluctuations and highlight the long-term trends in time series data.
Syntax: df['column_name'].rolling(window=n).mean()
27. Characterize the Data Frames in Pandas?
A DataFrame is a panda-specific lewis structure that functions with a two-dimensional display
with tomahawks (rows and columns). A DataFrame is a typical way of storing data that has two
separate indices, namely a row index and a column index. It includes the following
characteristics:
Columns such as int and bool are heterogeneous.
It's commonly thought of as a term reference for a series structure that includes both rows and
columns. If there are columns, it is denoted as "columns," and if there are lines, it is denoted
as "index."
Syntax: import pandas as pd
df=pd.Dataframe()
28. Explain how to create a series from dict in Pandas?
A Series is a one-dimensional designated array that can hold any form of data (python
objects, strings, integers, floating-point numbers, etc.). It's important to understand that,
unlike Python lists, a series always contains the same type of data.
Let's look at how to make a Panda Series using the Dictionary.
The Series () method is used without the index parameter.
29. Explain about the operation on Series in Pandas?
The Pandas Series is a one-dimensional classified array that may hold any type of data
(python objects, strings, integers, floating-point numbers, etc.). The axis identifiers are
referred to as an index. The Pandas Series is merely a column in an excel spreadsheet.
Putting Together a Pandas Series-
A Pandas Series is built in the real world by loading datasets from existing storage, which can
be a SQL database, a CSV file, or an Excel file. Pandas Series can be made from lists,
dictionaries, and other things. A series can be developed in a number of ways; here are a few
examples: cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval
cheval cheval cheval cheval cheval cheval
Creating a series from an array: To construct a series from an array, we must first load a
NumPy module and then use its array() functions.
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([‘M’,’I’,’N’,’D’,’M’,’A’,’J’,’I’,’X’])
ser = pd.Series(data)
print(ser)
Output: MINDMAJIX
30. Explain different ways of creating Data Frames in Panda?
A data frame can be created in 3 different ways:
By making use of lists:
d = [[‘a’, 2], [‘b’, 3], [‘c’, 4]]
Creating the Pandas Dataframe:
df = pd.DataFrame (d, columns = [‘Strings’, ‘Integer’])
print(df)
By making use of a dictionary of lists:
All of the arrays in a data frame made from a list's dictionaries must be the same length. If the
list is passed, the running time of the list will match the running time of the shows. If no
document is specified, the items will be a range (n), where n is the array length, as is
conventional.
By using arrays:
import pandas as pd
d = {‘Name’:[‘XYZ, ‘ABC’, ‘DEFC’, ‘ASWE’], ‘marks’:[85, 80, 75, 70]}
To indicate to other Python libraries that this column is a categorical variable (so that
appropriate statistical technique or plot types can be used).
35. Explain about Multi Indexing in pandas?
Multiple indexing is classified as fundamental indexing because it simplifies information
inspection and control, especially when dealing with higher-dimensional data. It also allows
us to store and handle data in lower-dimensional data structures like series and dataframes
with an unlimited number of measurements.
36. Explain about Pandas index?
Indexing in Pandas is the process of extracting specific rows and columns of data from a
DataFrame. Indexing could simply be selecting all of the rows and some of the columns, or
part of the rows and all of the columns, along with some of each row and column. Indexing is
often referred to as subset selection.
Pandas Indexing with [],.loc[],.iloc[], and.ix []
There are numerous methods for obtaining the objects, elements, items, rows, and columns
from a data frame. In Pandas, some indexing methods can be used to retrieve an
object/element/item from a data frame. These indexing systems look to be extremely similar.
However, they perform significantly differently. The Pandas support four methods of multi-
axis indexing, which are as follows:
• Dataframe. []: This method is also known as the indexing operator.
• Dataframe. loc []: This method is used for labels.
• Dataframe.iloc[] : This method is utilized for integer or position based
• Dataframe. ix[]: This function is utilized for both integer and label based
They are referred to collectively as indexers. All of those are, by far, the most popular
methods of indexing data. These four functions assist in retrieving the
object/elements/items, rows, and columns from a DataFrame.
37. Explain about Reindexing in Pandas?
The DataFrame is reindexed to adhere to a new index with configurable filling logic. It inserts
NA/NaN in the areas where the elements are missing from the previous index. Unless the new
index is constructed as equivalent to the present one, in which case the value becomes false.
It is used to modify the index of the dataframe's rows and columns.
38. Can you explain multi-indexing columns in Pandas?
Because it involves data manipulation and analysis, multiple indexing is characterized as vital
indexing. This is certainly relevant when operating with hyperdimensional data. It also
allows us to store and modify data in lower-dimensional data structures like DataFrame and
series with an indefinite number of dimensions.
Multiple Index Columns
Two columns will be used as index columns in this case. The drop option is used to remove a
column, whereas the append attribute is used to append given columns to an index column
that already exists.
Example:
# importing pandas library from
# python
import pandas as pd
# Creating data
Information = {'name': ["Saikat", "Shrestha", "Sandi", "Abinash"],
'Jobs': ["Software Developer", "System Engineer",
"Footballer", "Singer"],
'Annual Salary(L.P.A)': [12.4, 5.6, 9.3, 10]}
# Data Framing the whole data
df = pd.DataFrame(dict)
# Showing the above data
print(df)
Output:
• isupper(): If all of the characters in the Series/Index string are uppercase, it returns True.
Otherwise, False is returned.
• isnumeric(): If all of the characters in the Series/Index string are numeric, it returns True.
Otherwise, False is returned.
-> Count Values:
Using the 'value counts()' option, this process is used to count the overall possible
combinations.
42. How Can A Dataframe Be Converted To An Excel File?
Using the to excel () function, we can export the data frame to an excel file. We must mention
the target file name to write a single object to an excel file. If we wish to write to many
sheets, we must build an ExcelWriter object with the target filename and the sheet in the file
that we want to write to.
43. How To Format The Data in Your Pandas DataFrame?
Almost all of the time, you'll want to be ready to execute operations on the absolute
measurements in your data frame.
Replacing All String Occurrences in a DataFrame:
The Replace() method can be used to easily replace specific strings in your data frame.
Simply pass the values you are trying to enhance, accompanied by the values you would like
to substitute them with.
It's worth noting that there's a regex argument that can come in handy when dealing with
unusual string combinations. In a nutshell, replace() method is used when you wish to
substitute values or strings in your DataFrame with those from elsewhere.
Removing Parts From Strings in the Cells of Your DataFrame:
Removing unnecessary strings is a time-consuming task. Fortunately, there is a remedy! You
apply the lambda function to each element or element-by-element of the column using map()
on the column result. The function takes the string value and removes the + or — on the left,
as well as any of the six aAbBcC on the right.
Splitting Text in a Column into Multiple Rows in a DataFrame:
It's difficult to divide your text into many rows.
Applying A Function to Your Pandas DataFrame’s Columns or Rows:
You might want to use a function to alter the information in the DataFrame. The code pieces
illustrate how to implement a method to a DataFrame.
To implement any aggregation method across one or more columns, use the Dataframe.
aggregate() method. Use strings, callables, dictionaries, or a collection of strings to aggregate.
The following are the most common aggregations:
• Sum: This method returns the sum of the values for the requested column.
• Min: This method returns the minimum value for the requested column.
• Max: This method returns the maximum value for the requested column.
Syntax: DataFrame.aggregate(func, axis=0, *args, **kwargs)
function: string, callable, list, or dictionary of callables. Use this function to aggregate the
data. If a function is handed a data frame, it must either work or be allowed to pass to the data
frame. apply. If the variables are DataFrame column names, you can give a dict to a
DataFrame.
the axis (default 0) 1 or 'columns', 0 or 'index' Apply the method to each column with a 0 or
index. 1 or 'columns': for each row, apply the function.
Let us see an example for data aggregation:
# importing pandas package
import pandas as pd
# making data frame from csv file
df = pd.read_csv("nba.csv")
# printing the first 10 rows of the dataframe
df[:10]
45. What is the use of GroupBy Pandas?
The data is divided into groups using GroupBy. It organizes the data according to certain
parameters. Labels are mapped to group names when using grouping. It has a lot of different
versions that can be made using the parameters, and it makes separating data a breeze.
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs)
46. What is Pandas Numpy?
Pandas Numpy is an open-source Python package that would be used to work with a huge
number of datasets. It includes a robust N-dimensional array object as well as complicated
mathematical algorithms for data processing with Python.
Fourier transformations, linear programming, and pseudo-random capabilities are among the
prominent features provided by Numpy. It also includes integrated tools for C/C++ and
Fortran programming.
47. What is Vectorization in Python pandas?
The procedure of executing operations on the full array is known as vectorization. This is
intended to limit the number of iterations that the methods do. Pandas have a series of
vectorized methods, such as string functions and aggregations, that are optimized for use with
series and dataframes. As a result, it is preferable to use vectorized pandas methods to
perform the tasks quickly.
48. How will you combine different Data Frames in Panda?
Following are the ways to combine different Data Frames in panda:
-> append() method: This is used to horizontally stack the dataframes.
Syntax: df1.append(df2)
-> concat() method: This is used to sequentially stack data frames. This works best because
the data frames have the same fields and columns.
Syntax: pd.concat([df1, df2])
-> join() method: This is used to extract data from different dataframes that have one or more
common columns.
Syntax: df1.join(df2)
49. How can you iterate over the Data frame in Pandas?
Iterating over a DataFrame in pandas for loop can be merged with an iterrows () call.
50. What Method Will You Use To Rename The Index Or Columns Of Pandas Dataframe?
The .rename method is used to rename DataFrame index values or columns.
PART B (5 MARKS)
1. How do you load a CSV file named “data.csv” into a pandas DataFrame named “df”?
To load a CSV file into a pandas DataFrame, you use the pd.read_csv() function. This
function reads a CSV file and returns a DataFrame, which is a 2-dimensional labeled data
structure (like a table). Here’s a breakdown of how it works:
Syntax:
python
df = pd.read_csv('path_to_file.csv')
'path_to_file.csv' can be a local file path or a URL if the file is hosted online.
Explanation:
The read_csv() function reads the data from the file and automatically infers the columns and
the data types.
It loads the CSV data into a pandas DataFrame, where each row in the CSV file becomes a
row in the DataFrame, and each column in the CSV becomes a column in the DataFrame.
Additional Parameters:
delimiter: You can specify the delimiter if it's not a comma (e.g., \t for tab-delimited files).
header: If your CSV file doesn’t have a header row, you can specify the header row with
header=None.
names: If you want to specify column names manually, use the names parameter.
Example:
python
import pandas as pd
The df.head() function displays the first 5 rows of the DataFrame, giving us a quick overview
of the data.
If the file path is incorrect or the file is not found, Python will raise a FileNotFoundError. In
such cases, make sure to verify the file's location or use the absolute path.
2. How can you display the last 10 rows of a DataFrame named “df”?
In pandas, to display the last few rows of a DataFrame, you can use the .tail() method. By
default, .tail() will return the last 5 rows, but you can specify any number of rows you wish to
display by passing an integer as an argument.
Syntax:
python
df.tail(n)
n is the number of rows you want to display from the end of the DataFrame. If no value is
provided, it defaults to 5.
Explanation:
.tail(n) retrieves the last n rows of the DataFrame, which is particularly useful when you want
to inspect the data from the bottom of a large dataset.
This method doesn't alter the DataFrame itself but simply returns a view of the last n rows.
Example:
python
import pandas as pd
# Example DataFrame
data = {'name': ['John', 'Anna', 'Peter', 'Linda', 'James', 'Sarah'],
'age': [28, 24, 35, 32, 45, 29],
'city': ['New York', 'Los Angeles', 'Chicago', 'Miami', 'Boston', 'San Francisco']}
df = pd.DataFrame(data)
Important Points:
If you specify a value of n greater than the number of rows in the DataFrame, the method will
return the entire DataFrame.
.tail() is often used to check the last few records in time-series data or any dataset that is
ordered chronologically.
Use Case:
Imagine you have a DataFrame with sales data for multiple years, and you want to inspect the
most recent entries. Using .tail() helps you easily focus on the last portion of the data.
Both loc[] and iloc[] are used to access data in a pandas DataFrame, but they are used in
different contexts based on the type of indexing you need.
loc[] – Label-based indexing
loc[] is used when you want to access data based on labels (index labels or column names).
The loc[] function allows you to specify the rows and columns by their labels, which are the
names of the rows or columns in the DataFrame.
Inclusive of the start and end points: When you slice with loc[], both the start and end
labels are included.
Syntax:
python
df.loc[row_label, column_label]
iloc[] is used when you want to access data based on integer positions (index positions or
column positions).
The iloc[] function allows you to specify the rows and columns by their integer positions.
Exclusive of the end position: When you slice with iloc[], the end index is not included,
similar to Python's standard slicing behavior.
Syntax:
python
df.iloc[row_position, column_position]
Key Differences:
1. Type of Indexing:
o loc[] uses labels (row/column names).
o iloc[] uses integer positions (0-based indexing).
2. Slicing:
o With loc[], when slicing, both the start and the end labels are included.
o With iloc[], when slicing, the end index is excluded.
Examples:
Example using loc[]:
python
import pandas as pd
# Sample DataFrame
data = {'name': ['John', 'Anna', 'Peter', 'Linda'],
'age': [28, 24, 35, 32],
'city': ['New York', 'Los Angeles', 'Chicago', 'Miami']}
df = pd.DataFrame(data)
df.set_index('name', inplace=True)
We access the data using the label Anna and the column name age.
The slice df.loc['John':'Linda', 'city'] will return all rows between 'John' and 'Linda' (inclusive)
and the 'city' column.
python
In this example:
We access the data by position: df.iloc[1, 1] accesses the second row and second column
(remember, 0-based indexing).
The slice df.iloc[0:3, 2] returns rows from position 0 to position 2 (excluding position 3) and
the third column (index 2, which is 'city').
Important Notes:
When using loc[], the row/column names must exist in the DataFrame. If you try to access a
label that doesn't exist, pandas will raise a KeyError.
When using iloc[], you need to ensure that the integer positions you provide are within the
bounds of the DataFrame. If you try to access an index position that doesn’t exist, pandas will
raise an IndexError.
4. Write code to filter rows in a DataFrame where the value in the “age” column is greater
than 30.
In pandas, you can filter rows using Boolean indexing. To filter rows where the value in a
specific column meets a condition, you can use a condition inside the square brackets.
Syntax:
python
df[df['column_name'] condition]
Explanation:
df['age'] > 30 creates a Boolean series (True/False values) where each value is True if the
condition is met (i.e., age is greater than 30) and False otherwise.
This Boolean series is then used to filter the rows.
Example:
python
import pandas as pd
# Sample DataFrame
data = {'name': ['John', 'Anna', 'Peter', 'Linda'],
'age': [28, 24, 35, 32],
'city': ['New York', 'Los Angeles', 'Chicago', 'Miami']}
df = pd.DataFrame(data)
To drop rows with missing values (i.e., NaN), you can use the .dropna() method in pandas.
Syntax:
python
df.dropna()
Explanation:
.dropna() will return a new DataFrame with all rows containing any missing values removed.
By default, it drops rows where any column contains missing values.
It does not modify the original DataFrame unless you set inplace=True.
Example:
python
import pandas as pd
df = pd.DataFrame(data)
df['column_name'].sum()
Explanation:
.sum() adds up the values in the specified column.
The column must contain numerical values (integers, floats) to sum them.
Example:
python
import pandas as pd
# Sample DataFrame
data = {'product': ['A', 'B', 'C', 'D'],
'sales': [150, 200, 175, 225]}
df = pd.DataFrame(data)
In this example:
The .sum() method adds up all values in the 'sales' column and returns the result, which is
750.
7. Explain how the groupby() function works in pandas and provide an example.
The groupby() function in pandas is used to group data based on one or more columns and then
apply a function (e.g., sum, mean, count) to each group. It is similar to the "GROUP BY" operation in
SQL.
Syntax:
python
df.groupby('column_name').agg_function()
Explanation:
The groupby() function splits the DataFrame into groups based on the unique values of the
specified column(s).
After grouping, you can apply an aggregate function like sum(), mean(), or count() to each
group.
The result is a new DataFrame with the aggregated values.
Example:
python
import pandas as pd
# Sample DataFrame
data = {'category': ['A', 'B', 'A', 'B', 'A'],
'sales': [150, 200, 175, 225, 100]}
df = pd.DataFrame(data)
category
A 425
B 425
Name: sales, dtype: int64
In this example:
The DataFrame is grouped by the 'category' column.
The sum() function is applied to the 'sales' column within each group, resulting in the total
sales for each category.
import pandas as pd
# Sample DataFrame
data = {'date': ['2021-01-01', '2021-01-02', '2021-01-03'],
'product': ['A', 'B', 'A'],
'sales': [100, 150, 200]}
df = pd.DataFrame(data)
product A B
date
2021-01-01 100.0 NaN
2021-01-02 NaN 150.0
2021-01-03 200.0 NaN
10. Write code to sort a DataFrame by the “price” column in descending order.
To sort a DataFrame by a specific column, you can use the .sort_values() method.
Syntax:
python
df.sort_values(by='column_name', ascending=False)
Explanation:
The by parameter specifies the column(s) by which to sort.
The ascending=False parameter sorts the values in descending order.
Example:
python
import pandas as pd
# Sample DataFrame
data = {'product': ['A', 'B', 'C', 'D'],
'price': [150, 200, 175, 225]}
df = pd.DataFrame(data)
product price
3 D 225
1 B 200
2 C 175
0 A 150
1. Describe different methods for loading data into a panda Data Frame, and discuss the
advantages and disadvantages of each method.
1. read_csv():
o Description: This is the most common method for loading data stored in a CSV file
into a DataFrame. It reads data line by line and converts it into a tabular format.
o Advantages:
▪ Highly flexible, allows users to specify types, skip rows, or parse dates.
o Disadvantages:
▪ Parsing very large CSV files can take significant time and memory.
Example:
import pandas as pd
df = pd.read_csv("data.csv", delimiter=',')
2. read_excel():
o Description: This method is used to read Excel files (.xls, .xlsx) into pandas
DataFrame.
o Advantages:
▪ Useful when dealing with data in Excel files with complex formatting.
o Disadvantages:
▪ Requires the installation of libraries like openpyxl or xlrd to read .xlsx or .xls
files.
Example:
df = pd.read_excel("data.xlsx", sheet_name='Sheet1')
3. read_sql():
o Description: Loads data directly from a SQL database (e.g., MySQL, PostgreSQL)
into a pandas DataFrame by executing a SQL query.
o Advantages:
▪ Direct integration with databases, making it easier to work with large datasets
that reside in databases.
o Disadvantages:
▪ Performance may be slower than loading from a flat file, especially if the
query is complex.
Example:
import sqlite3
connection = sqlite3.connect("database.db")
4. read_json():
o Description: Reads data from a JSON file into a pandas DataFrame. JSON is a
commonly used format for web APIs and nested data structures.
o Advantages:
o Disadvantages:
▪ Not ideal for reading very large datasets or datasets with deeply nested
structures without preprocessing.
Example:
df = pd.read_json("data.json")
Conclusion:
• The choice of method depends on the format of the data you're working with (CSV, Excel,
JSON, SQL) and the size of the dataset. read_csv() is most commonly used, but other
methods like read_sql() and read_excel() provide better options for specific use cases like
databases and Excel files.
2. Explain the importance of indexing and slicing in pandas, and provide examples of both using
loc[] and iloc[].
• Indexing in pandas refers to selecting data based on its row or column labels, while slicing
refers to selecting subsets of rows or columns based on a given range or condition.
• Importance:
o Efficiency: Proper use of indexing and slicing enables efficient data manipulation,
especially when working with large datasets.
• loc[] allows for label-based selection. It can be used for row and column selection, both
using labels (names of rows and columns).
• Inclusive: When slicing with loc[], both the starting and ending points are included.
Example:
import pandas as pd
df = pd.DataFrame(data)
print(df.loc[1:3, ['name', 'age']]) # Rows from index 1 to 3 and columns 'name' and 'age'
Advantages of loc[]:
• iloc[] uses integer positions to select data. It is 0-indexed, meaning the first row has an index
of 0.
Example:
Advantages of iloc[]:
• Ideal when you need to access rows or columns by their position rather than their label.
Conclusion:
• loc[] is useful when you need to access data using row/column labels.
• iloc[] is useful when working with integer positions (especially useful for iterating over data
or when row/column labels are unknown or irrelevant).
3. Discuss common techniques for cleaning data in pandas, including handling missing values,
removing duplicates, and dealing with outliers. Provide code examples for each.
Data cleaning is one of the most critical steps in any data analysis pipeline, as it ensures the accuracy
and consistency of the dataset.
• Techniques:
o fillna(): Fills missing values with a specified value or method (e.g., forward fill or
backward fill).
Example:
df_cleaned = df.dropna()
df_forward_filled = df.fillna(method='ffill')
2. Removing Duplicates:
Example:
df_no_duplicates = df.drop_duplicates()
df_no_duplicates_columns = df.drop_duplicates(subset=['name'])
• Techniques:
o Z-score method: Identifies outliers by calculating the standard score (Z-score) for
each data point.
o IQR method: Uses the Interquartile Range (IQR) to find outliers by identifying
points that are outside the range defined by Q1 - 1.5IQR and Q3 + 1.5IQR.
import numpy as np
z_scores = np.abs(stats.zscore(df['age']))
Conclusion:
• Data cleaning techniques, including handling missing values, removing duplicates, and
dealing with outliers, are essential steps in preparing data for analysis. Using methods like
fillna(), drop_duplicates(), and outlier detection methods helps ensure that the data is reliable
and free of errors.
Data transformation techniques in pandas allow you to manipulate and structure data in various forms
for analysis.
1. Merging DataFrames:
• merge() is used to combine DataFrames based on one or more columns. It works similarly to
SQL joins (e.g., inner, outer, left, right).
Example:
python
2. Joining DataFrames:
• join() is used to join DataFrames on their index (by default) or on a specified column.
Example:
df_joined = df1.join(df2)
• pivot() reshapes data by turning unique column values into separate columns.
Example of melt():
You said:
ChatGPT said:
Got it! Let me provide the remaining six detailed explanations for the 10-mark questions. Here are the
answers:
5. Explain how the groupby() function works in pandas and how it can be used for data
aggregation. Provide examples of grouping data and applying aggregate functions.
groupby() in pandas:
• groupby() is used to split the data into groups based on some criteria and apply functions to
each group independently.
• It can be used to aggregate data (e.g., sum, mean), transform, or filter groups.
Steps:
Aggregation Functions:
• Common aggregation functions include sum(), mean(), count(), min(), max(), etc.
Example:
python
import pandas as pd
df = pd.DataFrame(data
print(grouped)
Output:
sales profit
category
A 600 120
B 400 80
• Here, the data is grouped by the category column, and the sum of sales and profit is
calculated for each group.
Conclusion:
• The groupby() function is essential for splitting data into groups, performing operations on
each group, and then combining the results. It is especially useful in aggregating large
datasets.
6. Discuss pandas’ capabilities for time series analysis, including resampling, rolling windows,
and date/time indexing. Provide examples of each.
Pandas provides robust support for time series analysis. It allows handling time-related data and
performing various operations like resampling, rolling windows, and date/time indexing.
1. Date/Time Indexing:
• In pandas, you can set a date column as an index to perform time-based indexing and
selection.
Example:
df.set_index('date', inplace=True)
print(df)
Output:
yaml
date
2023-01-01 1
2023-01-02 2
2023-01-03 3
2023-01-04 4
2023-01-05 5
2023-01-06 6
2. Resampling:
• Resampling is used to change the frequency of time series data (e.g., converting daily data to
monthly data).
Example:
python
# Resampling the data to a weekly frequency (sum the data within each week)
df_resampled = df.resample('W').sum()
print(df_resampled)
Output:
data
date
2023-01-01 1
2023-01-08 15
• resample() allows changing the frequency and applying aggregation functions like sum,
mean, etc.
3. Rolling Windows:
• Rolling windows provide a way to apply functions over a sliding window of data.
Example:
python
df['rolling_mean'] = df['data'].rolling(window=3).mean()
print(df)
Output:
yaml
data rolling_mean
date
2023-01-01 1 NaN
2023-01-02 2 NaN
2023-01-03 3 2.000000
2023-01-04 4 3.000000
2023-01-05 5 4.000000
2023-01-06 6 5.000000
• rolling() can be used for moving averages, sums, and other rolling operations.
Conclusion:
• Pandas provides powerful tools for working with time series data, including date indexing,
resampling, and rolling windows, making it easier to analyze temporal patterns and trends.
7. Describe how pandas integrates with popular visualization libraries like Matplotlib and
Seaborn for data visualization. Provide examples of creating various types of plots using
pandas.
Pandas integrates seamlessly with Matplotlib and Seaborn, two of the most popular data
visualization libraries, enabling you to easily visualize data from DataFrames.
• You can use the plot() function in pandas, which is built on top of Matplotlib to create
various plots (e.g., line plots, bar plots).
python
plt.show()
• Seaborn provides high-level interfaces for drawing attractive and informative statistical
graphics.
• It works directly with pandas DataFrames and simplifies visualizations like box plots,
histograms, and heatmaps.
plt.show()
python
df = pd.DataFrame({'category': ['A', 'B', 'C', 'D'], 'value': [23, 45, 56, 78]})
plt.show()
Conclusion:
• Pandas provides built-in support for Matplotlib and Seaborn, making it easy to generate a
variety of plots directly from DataFrames. You can create line plots, bar charts, box plots, and
more with minimal code, aiding in effective data visualization.
8. Discuss techniques for optimizing the performance of pandas operations, such as using
vectorized operations, avoiding loops, and utilizing specialized functions. Provide examples to
illustrate each technique.
1. Vectorized Operations:
• Vectorization allows for applying operations on entire arrays (or Series) at once, rather than
using loops, which significantly improves performance.
Example:
df['data'] = df['data'] + 1
2. Avoiding Loops:
• For loops are slow in pandas. It is better to use pandas’ built-in functions (which are
optimized for speed) to avoid explicit loops.
Example:
python
df['new_column'] = df['data'].apply(lambda x: x * 2)
3. Specialized Functions:
• pandas provides optimized functions for specific tasks, such as cumsum(), sum(), and
mean(), which are faster than writing custom loops.
Example:
python
df['cumsum'] = df['data'].cumsum()
Conclusion:
9. Explain how pandas handles categorical data and discuss the benefits of encoding categorical
variables. Provide examples of encoding categorical variables using pandas.
• Pandas has a Categorical data type, which allows you to represent data with a fixed number
of possible values, thus saving memory and improving performance when working with
repetitive text or categorical data.
• Faster Operations: Operations on categorical data are faster than on regular object data
types.
Example:
python
df['category'] = df['category'].astype('category')
# One-hot encoding
Conclusion:
By using categorical data types and techniques like one-hot encoding, pandas help to
efficiently manage categorical variables, reducing memory usage and speeding up processing.
10. Discuss common errors and exceptions encountered when working with pandas, and discuss
strategies for handling these errors gracefully. Provide examples of error handling techniques in
pandas.
1. KeyError: Occurs when trying to access a column or row that does not exist.
2. ValueError: Happens when the data shape or type is incompatible with an operation.
1. Try-Except Blocks: Catch and handle exceptions using try-except to prevent crashes.
Example:
python
try:
df['non_existent_column']
except KeyError:
Example:
python
df.get('non_existent_column', default_value)
Example:
python
Conclusion:
• Error handling is essential in data analysis workflows. By using try-except blocks, pandas-
specific methods like .get(), and handling missing values with .isna(), you can ensure that
your code is robust and can handle common issues without crashing.