0% found this document useful (0 votes)
21 views36 pages

Python Unit Iv - Pandas

The document contains lecture notes on data manipulation using the Pandas library in Python, covering key concepts such as DataFrames, Series, indexing, reindexing, and various methods for data aggregation and manipulation. It provides practical examples and code snippets for tasks like filtering data, handling missing values, and creating visualizations. The notes serve as a comprehensive guide for students learning to use Pandas for data analysis in Python.

Uploaded by

sakthi8164
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views36 pages

Python Unit Iv - Pandas

The document contains lecture notes on data manipulation using the Pandas library in Python, covering key concepts such as DataFrames, Series, indexing, reindexing, and various methods for data aggregation and manipulation. It provides practical examples and code snippets for tasks like filtering data, handling missing values, and creating visualizations. The notes serve as a comprehensive guide for students learning to use Pandas for data analysis in Python.

Uploaded by

sakthi8164
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

U23ADTC01 –Programming in Python 2024-2025

Lecture Notes
UNIT IV - DATA MANIPULATION WITH PANDAS

Prepared By:
Dr. M. Shanmugam
Mr. D. Rajesh
Mrs. M. Hemalatha
Mrs. S. Deeba
Ms. Amala Margret

PART- A (2 Marks)
1. What is pandas in Python? (K1)
Pandas is an open-source Python library with powerful and built-in methods to efficiently clean,
analyze, and manipulate datasets. Developed by Wes McKinney in 2008, this powerful package
can easily blend with various other data science modules in Python.
Pandas is built on top of the NumPy library, i.e., its data structures Series and DataFrame are
the upgraded versions of NumPy arrays.
2. How do you access the top 6 rows and last 7 rows of a pandas DataFrame?
The head() method in pandas is used to access the initial rows of a DataFrame, and tail() method
is used to access the last rows.
To access the top 6 rows: dataframe_name.head(6)
To access the last 7 rows: dataframe_name.tail(7)
3. Why doesn’t DataFrame.shape have parenthesis?
In pandas, shape is an attribute and not a method. So, you should access it without parentheses.
DataFrame.shape outputs a tuple with the number of rows and columns in a DataFrame.
4. What is the difference between Series and DataFrame?
DataFrame: The pandas DataFrame will be in tabular format with multiple rows and columns
where each column can be of different data types.
Series: The Series is a one-dimensional labeled array that can store any data type, but all of its
values should be of the same data type. The Series data structure is more like single column of
a DataFrame.
The Series data structure consumes less memory than a DataFrame. So, certain data
manipulation tasks are faster on it. However, DataFrame can store large and complex datasets,
while Series can handle only homogeneous data. So, the set of operations you can perform on
DataFrame is significantly higher than on Series data structure.
5. What is an index in pandas?
The index is a series of labels that can uniquely identify each row of a DataFrame. The index
can be of any datatype like integer, string, hash, etc.,
df.index prints the current row indexes of the DataFrame df.
Intermediate pandas Interview Questions
These questions will be a bit more challenging, and you are more likey to encounter them in
roles requiring previous experience using pandas.
6. What is Multi indexing in pandas?
Index in pandas uniquely specifies each row of a DataFrame. We usually pick the column that
can uniquely identify each row of a DataFrame and set it as the index. But what if you don’t
have a single column that can do this?

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

For example, you have the columns “name”, “age”, “address”, and “marks” in a DataFrame.
Any of the above columns may not have unique values for all the different rows and are unfit
as indexes.
However, the columns “name” and “address” together may uniquely identify each row of the
DataFrame. So you can set both columns as the index. Your DataFrame now has a multi-index
or hierarchical index.
7. Explain pandas Reindexing
Reindexing in pandas lets us create a new DataFrame object from the existing DataFrame with
the updated row indexes and column labels.
You can provide a set of new indexes to the function DataFrame.reindex() and it will create a
new DataFrame object with given indexes and take values from the actual DataFrame.
If values for these new indexes were not present in the original DataFrame, the function fills
those positions with the default nulls. However, we can alter the default value NaN to whatever
value we want them to fill with.
Here is the sample code:
Create a DataFrame df with indexes:
import pandas as pd
data = [['John', 50, 'Austin', 70],
['Cataline', 45 , 'San Francisco', 80],
['Matt', 30, 'Boston' , 95]]
columns = ['Name', 'Age', 'City', 'Marks']
#row indexes
idx = ['x', 'y', 'z']
df = pd.DataFrame(data, columns=columns, index=idx)
print(df)
Reindex with new set of indexes:
new_idx = ['a', 'y', 'z']
new_df = df.reindex(new_idx)
print(new_df)
The new_df has values from the df for common indexes ( ‘y’ and ‘z’), and the new index ‘a’ is
filled with the default NaN.
8. What is the difference between loc and iloc?
Both loc and the iloc methods in pandas are used to select subsets of a DataFrame. Practically,
these are widely used for filtering DataFrame based on conditions.
We should use the loc method to select data using actual labels of rows and columns, while the
iloc method is used to extract data based on integer indices of rows and columns.
9. Show two different ways to create a pandas DataFrame
Using Python Dictionary:
import pandas as pd
data = {'Name': ['John', 'Cataline', 'Matt'],
'Age': [50, 45, 30],
'City': ['Austin', 'San Francisco', 'Boston'],
'Marks' : [70, 80, 95]}
df = pd.DataFrame(data)
Using Python Lists:
import pandas as pd
data = [['John', 25, 'Austin',70],
['Cataline', 30, 'San Francisco',80],
['Matt', 35, 'Boston',90]]
columns = ['Name', 'Age', 'City', 'Marks']
df = pd.DataFrame(data, columns=columns)
10. How do you get the count of all unique values of a categorical column in a DataFrame?
The function Series.value_counts() returns the count of each unique value of a series or a
column.
Example:

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

We have created a DataFrame df that contains a categorical column named ‘Sex’, and
ran value_counts() function to see the count of each unique value in that column.
import pandas as pd
data = [['John', 50, 'Male', 'Austin', 70],
['Cataline', 45 ,'Female', 'San Francisco', 80],
['Matt', 30 ,'Male','Boston', 95]]
# Column labels of the DataFrame
columns = ['Name','Age','Sex', 'City', 'Marks']
# Create a DataFrame df
df = pd.DataFrame(data, columns=columns)
df['Sex'].value_counts()
11. How Do you optimize the performance while working with large datasets in pandas?
Load less data: While reading data using pd.read_csv(), choose only the columns you need
with the “usecols” parameter to avoid loading unnecessary data. Plus, specifying the
“chunksize” parameter splits the data into different chunks and processes them sequentially.
Avoid loops: Loops and iterations are expensive, especially when working with large datasets.
Instead, opt for vectorized operations, as they are applied on an entire column at once, making
them faster than row-wise iterations.
Use data aggregation: Try aggregating data and perform statistical operations because
operations on aggregated data are more efficient than on the entire dataset.
Use the right data types: The default data types in pandas are not memory efficient. For
example, integer values take the default datatype of int64, but if your values can fit in int32,
adjusting the datatype to int32 can optimize the memory usage.
Parallel processing: Dask is a pandas-like API to work with large datasets. It utilizes multiple
processes of your system to parallely execute different data tasks.
12. What is the difference between Join and Merge methods in pandas?
Join: Joins two DataFrames based on their index. However, there is an optional argument ‘on’
to explicitly specify if you want to join based on columns. By default, this function performs
left join.
Syntax: df1.join(df2)
Merge: The merge function is more versatile, allowing you to specify the columns on which
you want to join the DataFrames. It applies inner join by default, but can be customized to use
different join types like left, right, outer, inner, and cross.
Syntax: pd.merge(df1, df2, on=”column_names”)
13. What is Timedelta?
Timedelta represents the duration ie., the difference between two dates or times, measured in
units as days, hours, minutes, and seconds.
14. What is the difference between append and concat methods?
We can use the concat method to combine DataFrames either along rows or columns. Similarly,
append is also used to combine DataFrames, but only along the rows.
With the concat function, you have the flexibility to modify the original DataFrame using the
“inplace” parameter, while the append function can't modify the actual DataFrame, instead it
creates a new one with the combined data.
15. How do you read Excel files to CSV using pandas?
First, we should use the read_excel() function to pull in the Excel data to a variable. Then, just
apply the to_csv() function for a seamless conversion.
Here is the sample code:
import pandas as pd
#input your excel file path into the read_excel() function.
excel_data = pd.read_excel("/content/sample_data/california_housing_test.xlsx")
excel_data.to_csv("CSV_data.csv", index = None, header=True)
16. How do you sort a DataFrame based on columns?
We have the sort_values() method to sort the DataFrame based on a single column or multiple
columns.
Syntax: df.sort_values(by=[“column_names”])

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

Example code:
import pandas as pd
data = [['John', 50, 'Male', 'Austin', 70],
['Cataline', 45 ,'Female', 'San Francisco', 80],
['Matt', 30 ,'Male', 'Boston', 95],
['Oliver',35,'Male', 'New york', 65]]
# Column labels of the DataFrame
columns = ['Name','Age','Sex', 'City', 'Marks']
# Create a DataFrame df
df = pd.DataFrame(data, columns=columns)
# Sort values based on ‘Age’ column
df.sort_values(by=['Age'])
df.head()`
17. Show two different ways to filter data
To create a DataFrame:
import pandas as pd
data = {'Name': ['John', 'Cataline', 'Matt'],
'Age': [50, 45, 30],
'City': ['Austin', 'San Francisco', 'Boston'],
'Marks' : [70, 80, 95]}
# Create a DataFrame df
df = pd.DataFrame(data)
Method 1: Based on conditions
new_df = df[(df.Name == "John") | (df.Marks > 90)]
print (new_df)
Method 2: Using query function
df.query('Name == "John" or Marks > 90')
print (new_df)
18. How do you aggregate data and apply some aggregation function like mean or sum on it?
The groupby function lets you aggregate data based on certain columns and perform operations
on the grouped data. In the following code, the data is grouped on the column ‘Name’ and the
mean ‘Marks’ of each group is calculated.
import pandas as pd
# Create a DataFrame
data = {
'Name': ['John', 'Matt', 'John', 'Matt', 'Matt', 'Matt'],
'Marks': [10, 20, 30, 15, 25, 18]
}
# Create a DataFrame df
df = pd.DataFrame(data)
# mean marks of John and Matt
print(df.groupby('Name').mean())
19. How can you create a new column derived from existing columns?
We can use apply() method to derive a new column by performing some operations on existing
columns.
The following code adds a new column named ‘total’ to the DataFrame. This new column holds
the sum of values from the other two columns.
Example code:
import pandas as pd
# Create a DataFrame
data = {
'Name': ['John', 'Matt', 'John', 'Cateline'],
'math_Marks': [18, 20, 19, 15],
'science_Marks': [10, 20, 15, 12]
}

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

# Create a DataFrame df
df = pd.DataFrame(data)
df['total'] = df.apply(lambda row : row["math_Marks"] + row["science_Marks"], axis=1)
print(df)
20. How do you handle null or missing values in pandas?
You can use any of the following three methods to handle missing values in pandas:
dropna() – the function removes the missing rows or columns from the DataFrame.
fillna() – fill nulls with a specific value using this function.
interpolate() – this method fills the missing values with computed interpolation values. The
interpolation technique can be linear, polynomial, spline, time, etc.,
21. Difference between fillna() and interpolate() methods
fillna() –
fillna() fills the missing values with the given constant. Plus, you can give forward-filling or
backward-filling inputs to its ‘method’ parameter.
interpolate() –
By default, this function fills the missing or NaN values with the linear interpolated values.
However, you can customize the interpolation technique to polynomial, time, index, spline,
etc., using its ‘method’ parameter.
The interpolation method is highly suitable for time series data, whereas fillna is a more generic
approach.
22. What is Resampling?
Resampling is used to change the frequency at which time series data is reported. Imagine you
have monthly time series data and want to convert it into weekly data or yearly, this is where
resampling is used.
Converting monthly to weekly or daily data is nothing but upsampling. Interpolation techniques
are used to increase the frequencies here.
In contrast, converting monthly to yearly data is termed as downsampling, where data
aggregation techniques are applied.
23. How do you perform one-hot encoding using pandas?
We perform one hot encoding to convert categorical values to numeric ones so that can be fed
to the machine learning algorithm.
import pandas as pd
data = {'Name': ['John', 'Cateline', 'Matt', 'Oliver'],
'ID': [1, 22, 23, 36]}
df = pd.DataFrame(data)
#one hot encoding
new_df = pd.get_dummies(df.Name)
new_df.head()
24. How do you create a line plot in pandas?
To draw a line plot, we have a plot function in pandas.
import pandas as pd
data = {'units': [1, 2, 3, 4, 5],
'price': [7, 12, 8, 13, 16]}
# Create a DataFrame df
df = pd.DataFrame(data)
df.plot(x='units', y='price')
25. What is the pandas method to get the statistical summary of all the columns in a
DataFrame?
df.describe()
This method returns stats like mean, percentile values, min, max, etc., of each column in the
DataFrame.
26. What is Rolling mean?
Rolling mean is also referred to as moving average because the idea here is to compute the
mean of data points for a specified window and slide the window throughout the data. This will
lessen the fluctuations and highlight the long-term trends in time series data.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

Syntax: df['column_name'].rolling(window=n).mean()
27. Characterize the Data Frames in Pandas?
A DataFrame is a panda-specific lewis structure that functions with a two-dimensional display
with tomahawks (rows and columns). A DataFrame is a typical way of storing data that has two
separate indices, namely a row index and a column index. It includes the following
characteristics:
Columns such as int and bool are heterogeneous.
It's commonly thought of as a term reference for a series structure that includes both rows and
columns. If there are columns, it is denoted as "columns," and if there are lines, it is denoted
as "index."
Syntax: import pandas as pd
df=pd.Dataframe()
28. Explain how to create a series from dict in Pandas?
A Series is a one-dimensional designated array that can hold any form of data (python
objects, strings, integers, floating-point numbers, etc.). It's important to understand that,
unlike Python lists, a series always contains the same type of data.
Let's look at how to make a Panda Series using the Dictionary.
The Series () method is used without the index parameter.
29. Explain about the operation on Series in Pandas?
The Pandas Series is a one-dimensional classified array that may hold any type of data
(python objects, strings, integers, floating-point numbers, etc.). The axis identifiers are
referred to as an index. The Pandas Series is merely a column in an excel spreadsheet.
Putting Together a Pandas Series-
A Pandas Series is built in the real world by loading datasets from existing storage, which can
be a SQL database, a CSV file, or an Excel file. Pandas Series can be made from lists,
dictionaries, and other things. A series can be developed in a number of ways; here are a few
examples: cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval cheval
cheval cheval cheval cheval cheval cheval
Creating a series from an array: To construct a series from an array, we must first load a
NumPy module and then use its array() functions.
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array([‘M’,’I’,’N’,’D’,’M’,’A’,’J’,’I’,’X’])
ser = pd.Series(data)
print(ser)
Output: MINDMAJIX
30. Explain different ways of creating Data Frames in Panda?
A data frame can be created in 3 different ways:
By making use of lists:
d = [[‘a’, 2], [‘b’, 3], [‘c’, 4]]
Creating the Pandas Dataframe:
df = pd.DataFrame (d, columns = [‘Strings’, ‘Integer’])
print(df)
By making use of a dictionary of lists:
All of the arrays in a data frame made from a list's dictionaries must be the same length. If the
list is passed, the running time of the list will match the running time of the shows. If no
document is specified, the items will be a range (n), where n is the array length, as is
conventional.
By using arrays:
import pandas as pd
d = {‘Name’:[‘XYZ, ‘ABC’, ‘DEFC’, ‘ASWE’], ‘marks’:[85, 80, 75, 70]}

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

df = pd.DataFrame(d, index =[‘first’, ‘second’, ‘third’, ‘fourth’])


print(df)
31. Explain how to create empty DataFrames in Panda?
To make a Pandas data frame that is fully empty, perform the following:
import pandas as pd
MyEmptydf ()= pd.DataFrame
This will result in a data frame that has no columns or rows.
We do the following to construct an empty dataframe with three empty columns (columns A,
B, and C):
df= pd.DataFrame(columns=['A', 'B', 'C'])
32. How will you add a column to the Existing Data Frames in Panda?
import pandas as a package, import pandas as pd
# Define a dictionary containing employee data.
Employee ={ ‘Emp_name’:{‘Name’: [‘Ravi’, ‘Roshan', ‘Vinod’, ‘Sailu’],
‘ Emp_id’: [123, 234, 145, 125],
‘Emp_qualification’= [‘Msc’, ‘BA’, ‘MBA’, ‘Msc’]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(Employee)
# Declare a list that is to be converted into a column
Emp_address = [‘Hyderabad’, ‘Delhi’, ‘Lucknow’, ‘Vijayawada’]
# Using ‘Address’ as the column name
# and equating it to the list
df[‘Address’] = Emp_address
# Observe the result
df
Output:
Emp_name Emp_id Emp_qualification Emp_address

0 Ravi 123 MSC Hyderabad

1 Roshan 234 BA Delhi

2 Vinod 145 MBA Lucknow

3 Sailu 125 MSC Vijayawada


33. Tell us now how to retrieve a single column from a Panda Dataframe?
Use the query $django-admin.py to start a Django project, and then use the following queries:
Project
_init_.py
manage.py
settings.py
urls.py
34. Explain about Categorical Data in Pandas?
Categorical is a data type in Pandas that corresponds to categorical variables in statistics. A
categorical variable has a limited and usually fixed, set of possible values (categories; levels
in R). Gender, social class, blood type, national affiliation, observation time, or rating using
Likert scales are some examples. All categorical data values are either in categories or np.
nan.
In the following situations, categorical data is useful:
A string parameter with a small number of distinct values. Transforming a string parameter to
a category variable can help you save memory.
A variable's lexical order differs from its analytical order ("one," "two," and "three").
Indexing and min/max will utilize the analytical order rather than the lexical order after
transforming to a categorical and providing order on the categories.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

To indicate to other Python libraries that this column is a categorical variable (so that
appropriate statistical technique or plot types can be used).
35. Explain about Multi Indexing in pandas?
Multiple indexing is classified as fundamental indexing because it simplifies information
inspection and control, especially when dealing with higher-dimensional data. It also allows
us to store and handle data in lower-dimensional data structures like series and dataframes
with an unlimited number of measurements.
36. Explain about Pandas index?
Indexing in Pandas is the process of extracting specific rows and columns of data from a
DataFrame. Indexing could simply be selecting all of the rows and some of the columns, or
part of the rows and all of the columns, along with some of each row and column. Indexing is
often referred to as subset selection.
Pandas Indexing with [],.loc[],.iloc[], and.ix []
There are numerous methods for obtaining the objects, elements, items, rows, and columns
from a data frame. In Pandas, some indexing methods can be used to retrieve an
object/element/item from a data frame. These indexing systems look to be extremely similar.
However, they perform significantly differently. The Pandas support four methods of multi-
axis indexing, which are as follows:
• Dataframe. []: This method is also known as the indexing operator.
• Dataframe. loc []: This method is used for labels.
• Dataframe.iloc[] : This method is utilized for integer or position based
• Dataframe. ix[]: This function is utilized for both integer and label based
They are referred to collectively as indexers. All of those are, by far, the most popular
methods of indexing data. These four functions assist in retrieving the
object/elements/items, rows, and columns from a DataFrame.
37. Explain about Reindexing in Pandas?
The DataFrame is reindexed to adhere to a new index with configurable filling logic. It inserts
NA/NaN in the areas where the elements are missing from the previous index. Unless the new
index is constructed as equivalent to the present one, in which case the value becomes false.
It is used to modify the index of the dataframe's rows and columns.
38. Can you explain multi-indexing columns in Pandas?
Because it involves data manipulation and analysis, multiple indexing is characterized as vital
indexing. This is certainly relevant when operating with hyperdimensional data. It also
allows us to store and modify data in lower-dimensional data structures like DataFrame and
series with an indefinite number of dimensions.
Multiple Index Columns
Two columns will be used as index columns in this case. The drop option is used to remove a
column, whereas the append attribute is used to append given columns to an index column
that already exists.
Example:
# importing pandas library from
# python
import pandas as pd
# Creating data
Information = {'name': ["Saikat", "Shrestha", "Sandi", "Abinash"],
'Jobs': ["Software Developer", "System Engineer",
"Footballer", "Singer"],
'Annual Salary(L.P.A)': [12.4, 5.6, 9.3, 10]}
# Data Framing the whole data
df = pd.DataFrame(dict)
# Showing the above data
print(df)
Output:

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

39. What is meant by set the index in Pandas?


Python is an excellent language for analyzing data, particularly with its vast ecological
community of data-driven Python packages. Pandas is another of those packages, and it
makes data import and analysis considerably easier.
Pandas set_index () is a function for modifiying the index of a data frame from a data frame,
series, or list. The index column can also be set while creating a data frame. However, because
a data frame might be made up of two or more data frames, the index can be altered later using
this method.
Syntax:
DataFrame.set_index(keys, drop=True, append=False,
inplace=False, verify_integrity=False
Parameters:
keys: The name of the column or a list of column names.
If True, drop is a Boolean value that removes the index column.
If True, the column is appended to the existing index column.
Inplace, If True, the changes are made in the data frame.
If True, verify_integrity will check the new index column for duplicates.
Example:
# importing pandas library
import pandas as pd
# creating and initializing a nested list
students = [['jack', 34, 'Sydeny', 'Australia',85.96],
['Riti', 30, 'Delhi', 'India',95.20],
['Vansh', 31, 'Delhi', 'India',85.25],
['Nanyu', 32, 'Tokyo', 'Japan',74.21],
['Maychan', 16, 'New York', 'US',99.63],
['Mike', 17, 'las vegas', 'US',47.28]]
# Create a DataFrame object
df = pd.DataFrame(students,
columns=['Name', 'Age', 'City', 'Country','Agg_Marks'],
index=['a', 'b', 'c', 'd', 'e', 'f'])
# here we set Float column 'Agg_Marks' as index of data frame
# using dataframe.set_index() function
df = df.set_index('Agg_Marks')
# Displaying the Data frame
df
Output:

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

40. Explain how to reset the index in pandas?


Pandas is a one-dimensional ndarray with identifiers on the axes. The identifiers do not have
to be distinct, but they must be of the hashable type. The entity allows both label-based and
integer indexing, as well as a set of techniques for handling the index.
The pandas function series.reset_index () creates a reinvigorated series or data frame with the
index reset. This is useful when an index must be utilized as a column.
Syntax:
reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
Parameters:
level: int, str, tuple, or list None(default)
Only the specified levels should be removed from the index. By default, all levels are
removed.
drop: default False, bool
Inserting indexes into data frame columns is not recommended. This returns the index to its
original integer value.
inplace: False by default bool
Modify the existing DataFrame (do not create a new object).
col_level: default 0 for int and str
This determines the level the labels are inserted into if the columns have several levels. It is
inserted into the first level by default.
col_fill: default object
Evaluate how the other levels are named if the columns have different levels. If there is no
value, the index name is replicated.
# Import pandas package
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Kanth', 'Vinod, 'Seeraj', 'Kokila'],
'Age':[27, 26, 23, 30, 25],
'Address':['Delhi', 'Gujart', 'Hyderabad', 'Vizag', 'Noida'],
'Qualification':['MCA', 'Ms', ‘BA’, 'Phd', 'MS'] }
index = {'a', 'b', 'c', 'd', 'e'}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data, index)
# Make Own Index as index
# In this case default index is exist
df.reset_index(inplace = True)
df
Output:

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

41. Explain about Data operations in Pandas?


There are several useful data operations for DataFrame in Pandas, which are as follows:
-> Row and column selection:
We can retrieve any row and column of the DataFrame by specifying the names of the rows
and columns. It is one-dimensional and is regarded as a Series when you select it from the
DataFrame.
-> Filter Data:
By using some of the boolean logic in DataFrame, you may filter the data.
-> Null values:
When no data is being sent to the items, a Null value can appear. There may be no values in
the respective columns, which are commonly represented as NaN. Pandas provide several
useful functions for identifying, deleting, and changing null values in Data Frames. The
following are the functions:
• isnull(): isnull 's job is to return true if either of the rows have null values.
• notnull(): It is the inverse of the isnull() function, returning true values for non-null
values.
• dropna(): This function evaluates and removes null values from rows and columns.
• fillna(): It enables users to substitute other values for the NaN values.
• replace(): It's a powerful function that can take the role of a regex, dictionary,
string, series, and more.
• interpolate(): It's a useful function for filling null values in a series or data frame.
-> String Operation:
Pandas provide a set of string functions for working with string data while ignoring
missing/NaN values. The .str. option can be used to conduct various string operations. The
following are the functions:
• lower(): Any strings in the index or series are converted to lowercase letters.
• upper(): Any strings in the index or series are converted to uppercase letters.
• strip(): This method eliminates spacing from every string in the Series/index, along with a
new line.
• split(' '): It's a method that separates a string according to a pattern.
• cat(sep=' '): With a defined separator, it concatenates series/index items.
• contains(pattern): If a substring is available in the current element, it returns True;
otherwise, it returns False.
• replace(a,b): It substitutes the value b for the value a.
• repeat(value): Each element is repeated a defined multiple times.
• count(pattern): It returns the number of times a pattern appears in each element.
• startswith(pattern): If all of the components in the series begin with a pattern, it returns
True.
• endswith(pattern): If all of the components in the series terminate in a pattern, it returns
True.
• find(pattern): It can be used to return the pattern's first occurrence.
• findall(pattern): It gives you a list of all the times the pattern appears.
• swapcase: It is used to switch the lower/upper case.
• islower(): If all of the characters in the Series/Index string are lowercase, it returns True.
Otherwise, False is returned.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

• isupper(): If all of the characters in the Series/Index string are uppercase, it returns True.
Otherwise, False is returned.
• isnumeric(): If all of the characters in the Series/Index string are numeric, it returns True.
Otherwise, False is returned.
-> Count Values:
Using the 'value counts()' option, this process is used to count the overall possible
combinations.
42. How Can A Dataframe Be Converted To An Excel File?
Using the to excel () function, we can export the data frame to an excel file. We must mention
the target file name to write a single object to an excel file. If we wish to write to many
sheets, we must build an ExcelWriter object with the target filename and the sheet in the file
that we want to write to.
43. How To Format The Data in Your Pandas DataFrame?
Almost all of the time, you'll want to be ready to execute operations on the absolute
measurements in your data frame.
Replacing All String Occurrences in a DataFrame:
The Replace() method can be used to easily replace specific strings in your data frame.
Simply pass the values you are trying to enhance, accompanied by the values you would like
to substitute them with.
It's worth noting that there's a regex argument that can come in handy when dealing with
unusual string combinations. In a nutshell, replace() method is used when you wish to
substitute values or strings in your DataFrame with those from elsewhere.
Removing Parts From Strings in the Cells of Your DataFrame:
Removing unnecessary strings is a time-consuming task. Fortunately, there is a remedy! You
apply the lambda function to each element or element-by-element of the column using map()
on the column result. The function takes the string value and removes the + or — on the left,
as well as any of the six aAbBcC on the right.
Splitting Text in a Column into Multiple Rows in a DataFrame:
It's difficult to divide your text into many rows.
Applying A Function to Your Pandas DataFrame’s Columns or Rows:
You might want to use a function to alter the information in the DataFrame. The code pieces
illustrate how to implement a method to a DataFrame.

44. Explain about Data Aggregation in Pandas?

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

To implement any aggregation method across one or more columns, use the Dataframe.
aggregate() method. Use strings, callables, dictionaries, or a collection of strings to aggregate.
The following are the most common aggregations:
• Sum: This method returns the sum of the values for the requested column.
• Min: This method returns the minimum value for the requested column.
• Max: This method returns the maximum value for the requested column.
Syntax: DataFrame.aggregate(func, axis=0, *args, **kwargs)
function: string, callable, list, or dictionary of callables. Use this function to aggregate the
data. If a function is handed a data frame, it must either work or be allowed to pass to the data
frame. apply. If the variables are DataFrame column names, you can give a dict to a
DataFrame.
the axis (default 0) 1 or 'columns', 0 or 'index' Apply the method to each column with a 0 or
index. 1 or 'columns': for each row, apply the function.
Let us see an example for data aggregation:
# importing pandas package
import pandas as pd
# making data frame from csv file
df = pd.read_csv("nba.csv")
# printing the first 10 rows of the dataframe
df[:10]
45. What is the use of GroupBy Pandas?
The data is divided into groups using GroupBy. It organizes the data according to certain
parameters. Labels are mapped to group names when using grouping. It has a lot of different
versions that can be made using the parameters, and it makes separating data a breeze.
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs)
46. What is Pandas Numpy?
Pandas Numpy is an open-source Python package that would be used to work with a huge
number of datasets. It includes a robust N-dimensional array object as well as complicated
mathematical algorithms for data processing with Python.
Fourier transformations, linear programming, and pseudo-random capabilities are among the
prominent features provided by Numpy. It also includes integrated tools for C/C++ and
Fortran programming.
47. What is Vectorization in Python pandas?
The procedure of executing operations on the full array is known as vectorization. This is
intended to limit the number of iterations that the methods do. Pandas have a series of
vectorized methods, such as string functions and aggregations, that are optimized for use with
series and dataframes. As a result, it is preferable to use vectorized pandas methods to
perform the tasks quickly.
48. How will you combine different Data Frames in Panda?
Following are the ways to combine different Data Frames in panda:
-> append() method: This is used to horizontally stack the dataframes.
Syntax: df1.append(df2)
-> concat() method: This is used to sequentially stack data frames. This works best because
the data frames have the same fields and columns.
Syntax: pd.concat([df1, df2])
-> join() method: This is used to extract data from different dataframes that have one or more
common columns.
Syntax: df1.join(df2)
49. How can you iterate over the Data frame in Pandas?
Iterating over a DataFrame in pandas for loop can be merged with an iterrows () call.
50. What Method Will You Use To Rename The Index Or Columns Of Pandas Dataframe?
The .rename method is used to rename DataFrame index values or columns.

PART B (5 MARKS)

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

1. How do you load a CSV file named “data.csv” into a pandas DataFrame named “df”?

To load a CSV file into a pandas DataFrame, you use the pd.read_csv() function. This
function reads a CSV file and returns a DataFrame, which is a 2-dimensional labeled data
structure (like a table). Here’s a breakdown of how it works:

Step 1: Import pandas.

Step 2: Use pd.read_csv() to read the file.

Syntax:

python

df = pd.read_csv('path_to_file.csv')

'path_to_file.csv' can be a local file path or a URL if the file is hosted online.

Explanation:

The read_csv() function reads the data from the file and automatically infers the columns and
the data types.

It loads the CSV data into a pandas DataFrame, where each row in the CSV file becomes a
row in the DataFrame, and each column in the CSV becomes a column in the DataFrame.

Additional Parameters:

delimiter: You can specify the delimiter if it's not a comma (e.g., \t for tab-delimited files).

header: If your CSV file doesn’t have a header row, you can specify the header row with
header=None.

names: If you want to specify column names manually, use the names parameter.

Example:

python

import pandas as pd

# Load CSV file


df = pd.read_csv("data.csv")

# Display the first 5 rows to check


print(df.head())
In this example:

We load the CSV data from the file named data.csv.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

The df.head() function displays the first 5 rows of the DataFrame, giving us a quick overview
of the data.

If the file path is incorrect or the file is not found, Python will raise a FileNotFoundError. In
such cases, make sure to verify the file's location or use the absolute path.

2. How can you display the last 10 rows of a DataFrame named “df”?
In pandas, to display the last few rows of a DataFrame, you can use the .tail() method. By
default, .tail() will return the last 5 rows, but you can specify any number of rows you wish to
display by passing an integer as an argument.

Syntax:
python

df.tail(n)
n is the number of rows you want to display from the end of the DataFrame. If no value is
provided, it defaults to 5.

Explanation:
.tail(n) retrieves the last n rows of the DataFrame, which is particularly useful when you want
to inspect the data from the bottom of a large dataset.

This method doesn't alter the DataFrame itself but simply returns a view of the last n rows.

Example:
python

import pandas as pd

# Example DataFrame
data = {'name': ['John', 'Anna', 'Peter', 'Linda', 'James', 'Sarah'],
'age': [28, 24, 35, 32, 45, 29],
'city': ['New York', 'Los Angeles', 'Chicago', 'Miami', 'Boston', 'San Francisco']}

df = pd.DataFrame(data)

# Display last 10 rows


print(df.tail(10))
In this example, since the DataFrame df has only 6 rows, calling df.tail(10) will return all 6
rows.

Important Points:
If you specify a value of n greater than the number of rows in the DataFrame, the method will
return the entire DataFrame.

.tail() is often used to check the last few records in time-series data or any dataset that is
ordered chronologically.

Use Case:
Imagine you have a DataFrame with sales data for multiple years, and you want to inspect the
most recent entries. Using .tail() helps you easily focus on the last portion of the data.

3. Explain the difference between loc[] and iloc[] in pandas.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

Both loc[] and iloc[] are used to access data in a pandas DataFrame, but they are used in
different contexts based on the type of indexing you need.
loc[] – Label-based indexing

loc[] is used when you want to access data based on labels (index labels or column names).

The loc[] function allows you to specify the rows and columns by their labels, which are the
names of the rows or columns in the DataFrame.

Inclusive of the start and end points: When you slice with loc[], both the start and end
labels are included.

Syntax:
python

df.loc[row_label, column_label]

row_label: The label of the row you want to access.

column_label: The label of the column you want to access.

iloc[] – Integer-based indexing

iloc[] is used when you want to access data based on integer positions (index positions or
column positions).

The iloc[] function allows you to specify the rows and columns by their integer positions.

Exclusive of the end position: When you slice with iloc[], the end index is not included,
similar to Python's standard slicing behavior.

Syntax:
python

df.iloc[row_position, column_position]

row_position: The integer position of the row (0-indexed).

column_position: The integer position of the column (0-indexed).

Key Differences:
1. Type of Indexing:
o loc[] uses labels (row/column names).
o iloc[] uses integer positions (0-based indexing).
2. Slicing:
o With loc[], when slicing, both the start and the end labels are included.
o With iloc[], when slicing, the end index is excluded.
Examples:
Example using loc[]:
python

import pandas as pd

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

# Sample DataFrame
data = {'name': ['John', 'Anna', 'Peter', 'Linda'],
'age': [28, 24, 35, 32],
'city': ['New York', 'Los Angeles', 'Chicago', 'Miami']}

df = pd.DataFrame(data)
df.set_index('name', inplace=True)

# Accessing using labels with loc[]


print(df.loc['Anna', 'age']) # Output: 24
print(df.loc['John':'Linda', 'city']) # Rows from 'John' to 'Linda' and the 'city' column
In this example:

We access the data using the label Anna and the column name age.

The slice df.loc['John':'Linda', 'city'] will return all rows between 'John' and 'Linda' (inclusive)
and the 'city' column.

Example using iloc[]:

python

# Accessing using integer positions with iloc[]

print(df.iloc[1, 1]) # Output: 24 (row 1, column 1, i.e., 'Anna', 'age')

print(df.iloc[0:3, 2]) # Rows 0 to 2 and the 3rd column ('city')

In this example:

We access the data by position: df.iloc[1, 1] accesses the second row and second column
(remember, 0-based indexing).

The slice df.iloc[0:3, 2] returns rows from position 0 to position 2 (excluding position 3) and
the third column (index 2, which is 'city').

Important Notes:

When using loc[], the row/column names must exist in the DataFrame. If you try to access a
label that doesn't exist, pandas will raise a KeyError.

When using iloc[], you need to ensure that the integer positions you provide are within the
bounds of the DataFrame. If you try to access an index position that doesn’t exist, pandas will
raise an IndexError.

4. Write code to filter rows in a DataFrame where the value in the “age” column is greater
than 30.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

In pandas, you can filter rows using Boolean indexing. To filter rows where the value in a
specific column meets a condition, you can use a condition inside the square brackets.
Syntax:
python

df[df['column_name'] condition]
Explanation:
df['age'] > 30 creates a Boolean series (True/False values) where each value is True if the
condition is met (i.e., age is greater than 30) and False otherwise.
This Boolean series is then used to filter the rows.
Example:
python

import pandas as pd

# Sample DataFrame
data = {'name': ['John', 'Anna', 'Peter', 'Linda'],
'age': [28, 24, 35, 32],
'city': ['New York', 'Los Angeles', 'Chicago', 'Miami']}

df = pd.DataFrame(data)

# Filter rows where 'age' > 30


filtered_df = df[df['age'] > 30]

# Display filtered DataFrame


print(filtered_df)
Output:
markdown

name age city


2 Peter 35 Chicago
3 Linda 32 Miami
In this example:
The condition df['age'] > 30 filters out all rows where the "age" column has values greater
than 30.
The result is a new DataFrame containing only the rows that satisfy this condition.

5. How do you drop all rows with missing values in a DataFrame?

To drop rows with missing values (i.e., NaN), you can use the .dropna() method in pandas.
Syntax:
python

df.dropna()
Explanation:
.dropna() will return a new DataFrame with all rows containing any missing values removed.
By default, it drops rows where any column contains missing values.
It does not modify the original DataFrame unless you set inplace=True.
Example:
python

import pandas as pd

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

# Sample DataFrame with NaN values


data = {'name': ['John', 'Anna', 'Peter', None],
'age': [28, None, 35, 32],
'city': ['New York', 'Los Angeles', None, 'Miami']}

df = pd.DataFrame(data)

# Drop rows with missing values


cleaned_df = df.dropna()

# Display cleaned DataFrame


print(cleaned_df)
Output:
pgsql

name age city


0 John 28.0 New York
In this example:
The dropna() method removes rows where any column has a missing value (NaN).
Only the first row remains because it has no missing values.
You can also use additional parameters to customize how you drop rows:
axis=0: Drops rows (default behavior).
axis=1: Drops columns with missing values.
how='all': Only drops rows/columns if all values are missing.

6. Write code to calculate the sum of the “sales” column in a DataFrame.


To calculate the sum of a specific column in a DataFrame, you can use the .sum() method.
Syntax:
python

df['column_name'].sum()
Explanation:
.sum() adds up the values in the specified column.
The column must contain numerical values (integers, floats) to sum them.
Example:
python

import pandas as pd

# Sample DataFrame
data = {'product': ['A', 'B', 'C', 'D'],
'sales': [150, 200, 175, 225]}

df = pd.DataFrame(data)

# Calculate the sum of the 'sales' column


total_sales = df['sales'].sum()

# Display the total sales


print("Total Sales:", total_sales)
Output:
yaml

Total Sales: 750

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

In this example:
The .sum() method adds up all values in the 'sales' column and returns the result, which is
750.

7. Explain how the groupby() function works in pandas and provide an example.
The groupby() function in pandas is used to group data based on one or more columns and then
apply a function (e.g., sum, mean, count) to each group. It is similar to the "GROUP BY" operation in
SQL.
Syntax:
python

df.groupby('column_name').agg_function()
Explanation:
The groupby() function splits the DataFrame into groups based on the unique values of the
specified column(s).
After grouping, you can apply an aggregate function like sum(), mean(), or count() to each
group.
The result is a new DataFrame with the aggregated values.
Example:
python

import pandas as pd

# Sample DataFrame
data = {'category': ['A', 'B', 'A', 'B', 'A'],
'sales': [150, 200, 175, 225, 100]}

df = pd.DataFrame(data)

# Group by 'category' and sum the 'sales' within each group


grouped = df.groupby('category')['sales'].sum()

# Display grouped data


print(grouped)
Output:
css

category
A 425
B 425
Name: sales, dtype: int64
In this example:
The DataFrame is grouped by the 'category' column.
The sum() function is applied to the 'sales' column within each group, resulting in the total
sales for each category.

8. Describe the difference between merge() and join() functions in pandas.


Both merge() and join() functions are used to combine two DataFrames, but they are used in slightly
different ways.
merge()
Used to combine two DataFrames based on one or more columns.
It is more flexible and allows various types of joins, similar to SQL joins (inner, outer, left,
right).

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

It can join on any columns (not just the index).


join()
Used to join two DataFrames based on their index or optionally a column if specified.
It is simpler than merge(), typically used when you're working with DataFrames that have the
same index.
It is more efficient when you want to join on indices rather than columns.
Example of merge():
python

# Merge DataFrames on 'id' column


merged_df = pd.merge(df1, df2, on='id', how='inner')
Example of join():
python

# Join DataFrames on index


joined_df = df1.join(df2)

9. How can you pivot a DataFrame using the pivot() function?


The pivot() function is used to reshape data by turning unique values in a column into separate
columns, essentially transforming a DataFrame from long to wide format.
Syntax:
python

df.pivot(index='row_column', columns='column_column', values='value_column')


Explanation:
index: The column to use as the new index (rows).
columns: The column to use as the new columns.
values: The column to use for the values in the pivoted table.
Example:
python

import pandas as pd

# Sample DataFrame
data = {'date': ['2021-01-01', '2021-01-02', '2021-01-03'],
'product': ['A', 'B', 'A'],
'sales': [100, 150, 200]}

df = pd.DataFrame(data)

# Pivot the DataFrame


pivoted_df = df.pivot(index='date', columns='product', values='sales')

# Display pivoted DataFrame


print(pivoted_df)
Output:
pgsql

product A B
date
2021-01-01 100.0 NaN
2021-01-02 NaN 150.0
2021-01-03 200.0 NaN

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

10. Write code to sort a DataFrame by the “price” column in descending order.
To sort a DataFrame by a specific column, you can use the .sort_values() method.
Syntax:
python

df.sort_values(by='column_name', ascending=False)
Explanation:
The by parameter specifies the column(s) by which to sort.
The ascending=False parameter sorts the values in descending order.
Example:
python

import pandas as pd

# Sample DataFrame
data = {'product': ['A', 'B', 'C', 'D'],
'price': [150, 200, 175, 225]}

df = pd.DataFrame(data)

# Sort by 'price' in descending order


sorted_df = df.sort_values(by='price', ascending=False)

# Display sorted DataFrame


print(sorted_df)
Output:

product price
3 D 225
1 B 200
2 C 175
0 A 150

PART C (10 MARKS)

1. Describe different methods for loading data into a panda Data Frame, and discuss the
advantages and disadvantages of each method.

Methods for Loading Data into a panda Data Frame:

1. read_csv():

o Description: This is the most common method for loading data stored in a CSV file
into a DataFrame. It reads data line by line and converts it into a tabular format.

o Advantages:

▪ Fast for loading small to medium-sized datasets.

▪ Supports many parameters to handle different kinds of CSV files (e.g.,


different delimiters, handling missing values).

▪ Highly flexible, allows users to specify types, skip rows, or parse dates.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

o Disadvantages:

▪ Performance degrades with large files (memory issues).

▪ Does not handle other formats, such as Excel or JSON.

▪ Parsing very large CSV files can take significant time and memory.

Example:

import pandas as pd

df = pd.read_csv("data.csv", delimiter=',')

2. read_excel():

o Description: This method is used to read Excel files (.xls, .xlsx) into pandas
DataFrame.

o Advantages:

▪ Supports reading multiple sheets (via sheet_name parameter).

▪ Useful when dealing with data in Excel files with complex formatting.

▪ Can handle files with mixed data types in different columns.

o Disadvantages:

▪ Slower than read_csv() for large files.

▪ Requires the installation of libraries like openpyxl or xlrd to read .xlsx or .xls
files.

▪ Not suitable for very large files due to memory limitations.

Example:

df = pd.read_excel("data.xlsx", sheet_name='Sheet1')

3. read_sql():

o Description: Loads data directly from a SQL database (e.g., MySQL, PostgreSQL)
into a pandas DataFrame by executing a SQL query.

o Advantages:

▪ Direct integration with databases, making it easier to work with large datasets
that reside in databases.

▪ Allows complex SQL queries and joins directly within pandas.

▪ Supports reading data in chunks to handle very large datasets.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

o Disadvantages:

▪ Requires a live database connection.

▪ Performance may be slower than loading from a flat file, especially if the
query is complex.

▪ Limited to relational databases.

Example:
import sqlite3

connection = sqlite3.connect("database.db")

df = pd.read_sql("SELECT * FROM table_name", connection)

4. read_json():

o Description: Reads data from a JSON file into a pandas DataFrame. JSON is a
commonly used format for web APIs and nested data structures.

o Advantages:

▪ Excellent for handling data stored in JSON format (common in web


applications and APIs).

▪ Can handle nested JSON data by flattening it into columns.

o Disadvantages:

▪ Performance may suffer with large or deeply nested JSON files.

▪ Not ideal for reading very large datasets or datasets with deeply nested
structures without preprocessing.

Example:

df = pd.read_json("data.json")

Conclusion:

• The choice of method depends on the format of the data you're working with (CSV, Excel,
JSON, SQL) and the size of the dataset. read_csv() is most commonly used, but other
methods like read_sql() and read_excel() provide better options for specific use cases like
databases and Excel files.

2. Explain the importance of indexing and slicing in pandas, and provide examples of both using
loc[] and iloc[].

Indexing and Slicing in pandas:

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

• Indexing in pandas refers to selecting data based on its row or column labels, while slicing
refers to selecting subsets of rows or columns based on a given range or condition.

• Importance:

o Data Selection: Indexing allows quick access to rows or columns.

o Efficiency: Proper use of indexing and slicing enables efficient data manipulation,
especially when working with large datasets.

o Flexibility: Allows for label-based (using row/column names) and position-based


(using integer indices) selection.

Using loc[] (Label-based Indexing):

• loc[] allows for label-based selection. It can be used for row and column selection, both
using labels (names of rows and columns).

• Inclusive: When slicing with loc[], both the starting and ending points are included.

Example:

import pandas as pd

data = {'name': ['John', 'Anna', 'Peter', 'Linda'],

'age': [28, 24, 35, 32],

'city': ['New York', 'Los Angeles', 'Chicago', 'Miami']}

df = pd.DataFrame(data)

# Accessing row by label

print(df.loc[1]) # Access row with index label '1'

print(df.loc[1, 'name']) # Access 'name' column of row '1'

# Slicing rows by label

print(df.loc[1:3, ['name', 'age']]) # Rows from index 1 to 3 and columns 'name' and 'age'

Advantages of loc[]:

• Allows both row and column selection based on labels.

• Supports slicing based on labels (inclusive of the end label).

Using iloc[] (Integer-based Indexing):

• iloc[] uses integer positions to select data. It is 0-indexed, meaning the first row has an index
of 0.

• Exclusive: When slicing with iloc[], the end index is excluded.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

Example:

# Accessing by index position

print(df.iloc[0]) # First row (index 0)

print(df.iloc[0, 1]) # First row, second column (index 1)

# Slicing rows by position

print(df.iloc[1:3, 0:2]) # Rows 1 to 2, columns 0 to 1

Advantages of iloc[]:

• Ideal when you need to access rows or columns by their position rather than their label.

• Supports advanced integer-based slicing.

Conclusion:

• loc[] is useful when you need to access data using row/column labels.

• iloc[] is useful when working with integer positions (especially useful for iterating over data
or when row/column labels are unknown or irrelevant).

3. Discuss common techniques for cleaning data in pandas, including handling missing values,
removing duplicates, and dealing with outliers. Provide code examples for each.

Data cleaning is one of the most critical steps in any data analysis pipeline, as it ensures the accuracy
and consistency of the dataset.

1. Handling Missing Values:

• Techniques:

o dropna(): Removes rows or columns with missing values.

o fillna(): Fills missing values with a specified value or method (e.g., forward fill or
backward fill).

Example:

# Dropping rows with any NaN values

df_cleaned = df.dropna()

# Filling missing values with a specific value

df_filled = df.fillna({'age': 30, 'city': 'Unknown'})

# Forward filling missing values

df_forward_filled = df.fillna(method='ffill')

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

2. Removing Duplicates:

• Technique: drop_duplicates() removes duplicate rows based on all or specific columns.

Example:

# Removing duplicate rows

df_no_duplicates = df.drop_duplicates()

# Removing duplicates based on specific columns

df_no_duplicates_columns = df.drop_duplicates(subset=['name'])

3. Dealing with Outliers:

• Techniques:

o Z-score method: Identifies outliers by calculating the standard score (Z-score) for
each data point.

o IQR method: Uses the Interquartile Range (IQR) to find outliers by identifying
points that are outside the range defined by Q1 - 1.5IQR and Q3 + 1.5IQR.

Example (using Z-score):

from scipy import stats

import numpy as np

# Z-score method to identify outliers

z_scores = np.abs(stats.zscore(df['age']))

df_no_outliers = df[z_scores < 3] # Removing rows where Z-score > 3

Conclusion:

• Data cleaning techniques, including handling missing values, removing duplicates, and
dealing with outliers, are essential steps in preparing data for analysis. Using methods like
fillna(), drop_duplicates(), and outlier detection methods helps ensure that the data is reliable
and free of errors.

4. Describe various data transformation techniques in pandas, such as merging, joining,


reshaping, and pivoting. Provide examples for each transformation.

Data transformation techniques in pandas allow you to manipulate and structure data in various forms
for analysis.

1. Merging DataFrames:

• merge() is used to combine DataFrames based on one or more columns. It works similarly to
SQL joins (e.g., inner, outer, left, right).

Example:

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

python

df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['A', 'B', 'C']})

df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [28, 24, 30]})

# Merging on 'id' column

df_merged = pd.merge(df1, df2, on='id', how='inner')

2. Joining DataFrames:

• join() is used to join DataFrames on their index (by default) or on a specified column.

Example:

df1 = pd.DataFrame({'name': ['A', 'B', 'C']}, index=[1, 2, 3])

df2 = pd.DataFrame({'age': [28, 24, 35]}, index=[1, 2, 3])

# Joining df1 and df2 by index

df_joined = df1.join(df2)

3. Reshaping Data (Melt and Pivot):

• melt() transforms the DataFrame from wide to long format.

• pivot() reshapes data by turning unique column values into separate columns.

Example of melt():

df = pd.DataFrame({'id': [1, 2], 'var1': [10,

You said:

only four another 6 i want

ChatGPT said:

Got it! Let me provide the remaining six detailed explanations for the 10-mark questions. Here are the
answers:

5. Explain how the groupby() function works in pandas and how it can be used for data
aggregation. Provide examples of grouping data and applying aggregate functions.

groupby() in pandas:

• groupby() is used to split the data into groups based on some criteria and apply functions to
each group independently.

• It can be used to aggregate data (e.g., sum, mean), transform, or filter groups.

Steps:

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

1. Split: The data is split into groups based on specified column(s).

2. Apply: An aggregation or transformation is applied to each group.

3. Combine: The results are combined back into a DataFrame.

Aggregation Functions:

• Common aggregation functions include sum(), mean(), count(), min(), max(), etc.

Example:

python

import pandas as pd

data = {'category': ['A', 'A', 'B', 'B', 'A'],

'sales': [100, 200, 150, 250, 300],

'profit': [20, 40, 30, 50, 60]}

df = pd.DataFrame(data

# Grouping by category and calculating the sum of sales and profit

grouped = df.groupby('category').agg({'sales': 'sum', 'profit': 'sum'})

print(grouped)

Output:

sales profit

category

A 600 120

B 400 80

• Here, the data is grouped by the category column, and the sum of sales and profit is
calculated for each group.

Conclusion:

• The groupby() function is essential for splitting data into groups, performing operations on
each group, and then combining the results. It is especially useful in aggregating large
datasets.

6. Discuss pandas’ capabilities for time series analysis, including resampling, rolling windows,
and date/time indexing. Provide examples of each.

Pandas provides robust support for time series analysis. It allows handling time-related data and
performing various operations like resampling, rolling windows, and date/time indexing.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

1. Date/Time Indexing:

• In pandas, you can set a date column as an index to perform time-based indexing and
selection.

Example:

# Creating a time series

dates = pd.date_range('2023-01-01', periods=6, freq='D')

df = pd.DataFrame({'date': dates, 'data': [1, 2, 3, 4, 5, 6]})

df.set_index('date', inplace=True)

print(df)

Output:

yaml

date

2023-01-01 1

2023-01-02 2

2023-01-03 3

2023-01-04 4

2023-01-05 5

2023-01-06 6

2. Resampling:

• Resampling is used to change the frequency of time series data (e.g., converting daily data to
monthly data).

Example:

python

# Resampling the data to a weekly frequency (sum the data within each week)

df_resampled = df.resample('W').sum()

print(df_resampled)

Output:

data

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

date

2023-01-01 1

2023-01-08 15

• resample() allows changing the frequency and applying aggregation functions like sum,
mean, etc.

3. Rolling Windows:

• Rolling windows provide a way to apply functions over a sliding window of data.

Example:

python

# Applying a rolling window of size 3 and calculating the mean

df['rolling_mean'] = df['data'].rolling(window=3).mean()

print(df)

Output:

yaml

data rolling_mean

date

2023-01-01 1 NaN

2023-01-02 2 NaN

2023-01-03 3 2.000000

2023-01-04 4 3.000000

2023-01-05 5 4.000000

2023-01-06 6 5.000000

• rolling() can be used for moving averages, sums, and other rolling operations.

Conclusion:

• Pandas provides powerful tools for working with time series data, including date indexing,
resampling, and rolling windows, making it easier to analyze temporal patterns and trends.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

7. Describe how pandas integrates with popular visualization libraries like Matplotlib and
Seaborn for data visualization. Provide examples of creating various types of plots using
pandas.

Pandas integrates seamlessly with Matplotlib and Seaborn, two of the most popular data
visualization libraries, enabling you to easily visualize data from DataFrames.

1. Plotting with Matplotlib:

• You can use the plot() function in pandas, which is built on top of Matplotlib to create
various plots (e.g., line plots, bar plots).

Example (Line Plot):

python

import matplotlib.pyplot as plt

# Creating a simple line plot

df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [1, 4, 9, 16, 25]})

df.plot(x='x', y='y', kind='line', title='Line Plot')

plt.show()

2. Plotting with Seaborn:

• Seaborn provides high-level interfaces for drawing attractive and informative statistical
graphics.

• It works directly with pandas DataFrames and simplifies visualizations like box plots,
histograms, and heatmaps.

Example (Box Plot):

import seaborn as sns

# Creating a box plot

sns.boxplot(x='category', y='data', data=df)

plt.show()

3. Histograms and Bar Plots:

• Histograms help visualize the distribution of numerical data.

• Bar plots are useful for comparing values across categories.

Example (Bar Plot):

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

python

df = pd.DataFrame({'category': ['A', 'B', 'C', 'D'], 'value': [23, 45, 56, 78]})

df.plot(kind='bar', x='category', y='value', title='Bar Plot')

plt.show()

Conclusion:

• Pandas provides built-in support for Matplotlib and Seaborn, making it easy to generate a
variety of plots directly from DataFrames. You can create line plots, bar charts, box plots, and
more with minimal code, aiding in effective data visualization.

8. Discuss techniques for optimizing the performance of pandas operations, such as using
vectorized operations, avoiding loops, and utilizing specialized functions. Provide examples to
illustrate each technique.

1. Vectorized Operations:

• Vectorization allows for applying operations on entire arrays (or Series) at once, rather than
using loops, which significantly improves performance.

Example:

# Without vectorization (slow)

df['data'] = df['data'] + 1

# With vectorization (faster)

df['data'] += 1 # This is a vectorized operation

2. Avoiding Loops:

• For loops are slow in pandas. It is better to use pandas’ built-in functions (which are
optimized for speed) to avoid explicit loops.

Example:

python

# Instead of looping, use apply (vectorized operation)

df['new_column'] = df['data'].apply(lambda x: x * 2)

3. Specialized Functions:

• pandas provides optimized functions for specific tasks, such as cumsum(), sum(), and
mean(), which are faster than writing custom loops.

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

Example:

python

# Using specialized function for cumulative sum (faster than looping)

df['cumsum'] = df['data'].cumsum()

Conclusion:

• Optimizing performance in pandas involves using vectorized operations and specialized


functions, which are much faster than traditional for-loops. These techniques are essential
when working with large datasets to ensure scalability and efficiency.

9. Explain how pandas handles categorical data and discuss the benefits of encoding categorical
variables. Provide examples of encoding categorical variables using pandas.

Handling Categorical Data in pandas:

• Pandas has a Categorical data type, which allows you to represent data with a fixed number
of possible values, thus saving memory and improving performance when working with
repetitive text or categorical data.

Benefits of Categorical Data:

• Memory Efficient: Categoricals use less memory than object types.

• Faster Operations: Operations on categorical data are faster than on regular object data
types.

• Sorting: Categories can be ordered, which is useful in some statistical or plotting


applications.

Encoding Categorical Variables:

• astype('category'): Converts a column into categorical data type.

• pd.get_dummies(): Converts categorical variables into a set of binary variables (dummy/one-


hot encoding).

Example:

python

# Converting to categorical type

df['category'] = df['category'].astype('category')

# One-hot encoding

df_dummies = pd.get_dummies(df, columns=['category'])

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

Conclusion:

By using categorical data types and techniques like one-hot encoding, pandas help to
efficiently manage categorical variables, reducing memory usage and speeding up processing.

10. Discuss common errors and exceptions encountered when working with pandas, and discuss
strategies for handling these errors gracefully. Provide examples of error handling techniques in
pandas.

Common Errors and Exceptions:

1. KeyError: Occurs when trying to access a column or row that does not exist.

2. ValueError: Happens when the data shape or type is incompatible with an operation.

3. TypeError: When performing operations with incompatible data types.

Error Handling Strategies:

1. Try-Except Blocks: Catch and handle exceptions using try-except to prevent crashes.

Example:

python

try:

df['non_existent_column']

except KeyError:

print("The specified column does not exist.")

2. get() Method: For safer dictionary-like access in pandas Series/DataFrames.

Example:

python

# Safer way to access column

df.get('non_existent_column', default_value)

3. Using .isna(): To handle missing values in a controlled way.

Example:

python

df['column_name'].isna().sum() # Check how many missing values in a column

Conclusion:

Sri Manakula Vinayagar Engineering College Department of CSE


U23ADTC01 –Programming in Python 2024-2025

• Error handling is essential in data analysis workflows. By using try-except blocks, pandas-
specific methods like .get(), and handling missing values with .isna(), you can ensure that
your code is robust and can handle common issues without crashing.

Sri Manakula Vinayagar Engineering College Department of CSE

You might also like