0% found this document useful (0 votes)
24 views25 pages

KPMG Interview

The document discusses various aspects of handling data in Apache Spark, including dictionary manipulation, data masking, SQL queries for course completion statistics, partitioning behavior, managed identities, and schema management for CSV files. It highlights the importance of understanding how Spark processes data, the implications of schema mismatches, and the methods for creating shallow and deep copies of data structures. Additionally, it provides insights into optimizing data reading and processing in Spark, including the impact of configuration settings on partitioning and performance.

Uploaded by

er.evil323476
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views25 pages

KPMG Interview

The document discusses various aspects of handling data in Apache Spark, including dictionary manipulation, data masking, SQL queries for course completion statistics, partitioning behavior, managed identities, and schema management for CSV files. It highlights the importance of understanding how Spark processes data, the implications of schema mismatches, and the methods for creating shallow and deep copies of data structures. Additionally, it provides insights into optimizing data reading and processing in Spark, including the impact of configuration settings on partitioning and performance.

Uploaded by

er.evil323476
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Value of below output

emp_detail_dict = {"emp_name": "Gokul",

"emp_id": "KGS123",

"emp_contact": "9876543210",

"emp_location": "Bangalore"}

emp_detail_dict_clone = emp_detail_dict

emp_detail_dict["emp_contact"] = "0123456789"

emp_detail_dict["emp_location"] = "Kochin"

print(emp_detail_dict_clone)

Assignment: emp_detail_dict_clone = emp_detail_dict does not create a copy but rather assigns
emp_detail_dict_clone to the same dictionary object as emp_detail_dict. Both variables now point
to the same dictionary in memory.

Modification: When you update emp_detail_dict, you are modifying the original dictionary object.
Since emp_detail_dict_clone references the same object, these changes are reflected when you
access emp_detail_dict_clone.

Output:

{'emp_name': 'Gokul', 'emp_id': 'KGS123', 'emp_contact': '9876543210', 'emp_location': 'Bangalore'}

if done
# Creating a shallow copy

emp_detail_dict_clone = copy.copy(emp_detail_dict)

in this case, the changes to emp_detail_dict are not reflected in emp_detail_dict_clone because
copy.copy() created a new dictionary object but with the same references to nested objects.

# Creating a deep copy

if emp_detail_dict_clone = copy.deepcopy(emp_detail_dict)

Here, copy.deepcopy() creates a completely independent copy of the dictionary, so changes to the
original dictionary do not affect the deep copy.

Data Masking, need only last 3 digits, rest should be X

|emp_id| name| phone|

+------+------+-------------+

| 1| Gokul|+919098903987|

| 2|Deepak|+917898920928|

| 12| Sunil|+918987678923|

| 34| Atul| +1233909238|

| 10| Pawan|+443290390924|

+------+------+-------------+

last_3_digits = substring(col("phone"), -3, 3)


# Mask all other digits with 'X'

masked_phone = concat(lit('X').repeat(length(col("phone")) - 3), last_3_digits)

# Create a new DataFrame with the masked phone number

masked_df = df.withColumn("masked_phone", masked_phone)

# Show the result

masked_df.select("emp_id", "name", "masked_phone").show(truncate=False)

==================================================================================
=======

Table: OnlineCourses

Columns Expected: CourseID,TotalEnrollments,TotalCompletions,DropRate

|CourseID|EnrollmentDate|StudentID|CompletionDate|

+--------+--------------+---------+--------------+

| 1| 2021-01-01| 1| 2021-01-15|

| 1| 2021-01-01| 2| null|

| 1| 2021-01-01| 3| 2021-01-10|

| 2| 2021-02-01| 1| 2021-02-10|

| 2| 2021-02-01| 2| null|

| 2| 2021-02-01| 3| null|

| 3| 2021-03-01| 1| 2021-03-10|

| 3| 2021-03-01| 2| 2021-03-15|

| 3| 2021-03-01| 3| null|
+--------+--------------+---------+--------------+

Here’s a SQL query that will produce the desired output:

sql

Copy code

SELECT

CourseID,

COUNT(*) AS TotalEnrollments,

SUM(CASE WHEN CompletionDate IS NOT NULL THEN 1 ELSE 0 END) AS TotalCompletions,

ROUND(

(1.0 * SUM(CASE WHEN CompletionDate IS NULL THEN 1 ELSE 0 END) / COUNT(*)) * 100,

) AS DropRate

FROM

OnlineCourses

GROUP BY

CourseID;

Explanation:

TotalEnrollments: COUNT(*) gives the total number of enrollments for each course.

TotalCompletions: SUM(CASE WHEN CompletionDate IS NOT NULL THEN 1 ELSE 0 END) counts the
number of students who completed the course.

DropRate: To calculate the drop rate:

SUM(CASE WHEN CompletionDate IS NULL THEN 1 ELSE 0 END)

-------counts the number of students who did not complete the course.
COUNT(*) gives the total number of enrollments.

1.0 * ensures that the division is done in floating-point arithmetic to get a decimal result.

(1.0 * SUM(CASE WHEN CompletionDate IS NULL THEN 1 ELSE 0 END) / COUNT(*)) * 100 computes
the percentage of students who dropped out.

ROUND(..., 2) rounds the drop rate to two decimal places.

This query will give you the CourseID, total number of enrollments, total number of completions, and
the drop rate as a percentage for each course in the OnlineCourses table.

Given Data

Here’s the OnlineCourses table again:

CourseID EnrollmentDate StudentID CompletionDate

1 2021-01-01 1 2021-01-15

1 2021-01-01 2 NULL

1 2021-01-01 3 2021-01-10

2 2021-02-01 1 2021-02-10

2 2021-02-01 2 NULL

2 2021-02-01 3 NULL

3 2021-03-01 1 2021-03-10

3 2021-03-01 2 2021-03-15

3 2021-03-01 3 NULL

SELECT

CourseID,

COUNT(*) AS TotalEnrollments,

SUM(CASE WHEN CompletionDate IS NOT NULL THEN 1 ELSE 0 END) AS TotalCompletions,

ROUND(

(1.0 * SUM(CASE WHEN CompletionDate IS NULL THEN 1 ELSE 0 END) / COUNT(*)) * 100,

) AS DropRate
FROM

OnlineCourses

GROUP BY

CourseID;

Step-by-Step Calculation

For CourseID 1:

TotalEnrollments: 3 (three entries for CourseID 1)

TotalCompletions: 2 (students with CompletionDate not NULL: StudentID 1 and 3)

DropRate:

Number of drops = 1 (student with CompletionDate NULL: StudentID 2)

Drop rate = (1 / 3) * 100 = 33.33%

For CourseID 2:

TotalEnrollments: 3 (three entries for CourseID 2)

TotalCompletions: 1 (student with CompletionDate not NULL: StudentID 1)

DropRate:

Number of drops = 2 (students with CompletionDate NULL: StudentID 2 and 3)

Drop rate = (2 / 3) * 100 = 66.67%

For CourseID 3:

TotalEnrollments: 3 (three entries for CourseID 3)

TotalCompletions: 2 (students with CompletionDate not NULL: StudentID 1 and 2)

DropRate:

Number of drops = 1 (student with CompletionDate NULL: StudentID 3)

Drop rate = (1 / 3) * 100 = 33.33%

Result Set

After running the query, you’ll get:


CourseID TotalEnrollments TotalCompletions DropRate

1 3 2 33.33

2 3 1 66.67

3 3 2 33.33

This result shows the total number of enrollments, completions, and the calculated drop rate for
each course.

=========================================================================

maximum partition size is 128 MB, & we have a 10 mb file,


but still 2 partitions internally? why?

In Apache Spark, the partitioning of data when reading from a file is influenced by several factors.
Even if you have a small file (e.g., 10 MB) and the default maximum partition size is 128 MB, the file
may still be divided into multiple partitions. Here are some reasons why this can happen:

Reasons for Multiple Partitions

File Size vs. Partitioning Strategy:

Spark's partitioning strategy is not solely based on file size. The default behavior when reading files
into a DataFrame or RDD involves creating partitions based on the file's block size and other factors,
such as the number of splits Spark makes.

Default Behavior and Splitting:

By default, Spark may create multiple partitions based on the underlying file system's block size and
the way it splits files. This can result in files being split into multiple partitions even if they are smaller
than the maximum partition size.

File Format and Splits:


The behavior can vary based on the file format. For example, text files might be split differently
compared to Parquet or ORC files. Text files can be split by line breaks or other delimiters, which
might result in multiple partitions even if the total file size is small.

Hadoop Configuration:

In the case of files read from Hadoop-compatible file systems (like HDFS), the Hadoop block size
configuration can influence how Spark splits the file. If the Hadoop block size is smaller than Spark's
partition size, it can affect the number of partitions created.

Spark Configuration Settings:

Certain Spark configuration settings, such as spark.sql.files.maxPartitionBytes, control the maximum


partition size. Even if the file size is smaller, settings like spark.sql.files.minPartitionNum or
spark.sql.files.maxPartitionBytes might affect the number of partitions.

Example Configuration

Here’s how Spark configuration settings can impact partitioning:

spark.sql.files.maxPartitionBytes: Defines the maximum number of bytes to pack into a single


partition when reading files. If the file is smaller than this threshold, Spark might still create multiple
partitions based on other factors.

spark.sql.files.minPartitionNum: Defines the minimum number of partitions to use when reading


files. If this setting is specified, Spark might create more partitions to meet this minimum
requirement.

Example Code for Checking Partitions

To verify how many partitions a DataFrame has and to investigate further, you can use the following
code:

python

Copy code

df = spark.read.format("csv").option("header", "true").load("path/to/your/file.csv")

# Check the number of partitions

num_partitions = df.rdd.getNumPartitions()

print(f"Number of partitions: {num_partitions}")

Summary
Even though the file size (10 MB) is smaller than the default maximum partition size (128 MB), Spark
may still create multiple partitions due to its internal partitioning strategy, file format, block size, and
configuration settings. Understanding these factors can help you fine-tune partitioning behavior in
your Spark jobs.

======================================================================

managed identity?

A managed identity is a feature provided by cloud platforms like Microsoft Azure that allows services
and applications to authenticate and interact with other services securely, without the need to
manage credentials directly. Managed identities simplify the authentication process by providing a
system-assigned or user-assigned identity that can be used to access resources and services.

==============================================================================

df = spark.read.format("csv").load(path)
2) df =
spark.read.format("csv").option("inferSchema",True).load(path)
3) df = spark.read.format("csv").schema(schema).load(path)

which will have used more processing time and compute power?
how many jobs for three lines above?

Processing Time and Compute Power:


Basic CSV Reading: Least amount of processing, as it does not involve schema inference.

CSV Reading with Schema Inference: More processing time and compute power due to the additional
schema inference step.

CSV Reading with Explicit Schema: Generally the most efficient, as it avoids the overhead of schema
inference.

Number of Jobs:

Basic CSV Reading: 1 job.

CSV Reading with Schema Inference: Typically 2 jobs (one for schema inference, one for the actual
data reading).

CSV Reading with Explicit Schema: 1 job.

By specifying the schema explicitly, you can reduce the overall processing time and computational
overhead, especially for large datasets or when schema inference is costly.

===========================================================================

file_path = "/mnt/rawcontainer/empinfo/empdata.csv"

if mistakenly specify a schema where a column intended to be of type int is


defined as string
emp_id --> int

emp_name --> string

emp_age --> int

emp_location --> string


If you mistakenly specify a schema where a column intended to be of type int is defined as string, it
can lead to various issues depending on how Spark processes the data. Here's a detailed explanation
of what can happen and how to resolve the issue:

What Happens

Schema Mismatch During Data Read: When you provide a schema with emp_id as string while your
data actually contains integer values for emp_id, Spark will interpret the emp_id column as a string.
This mismatch may not immediately raise an error but can lead to data inconsistency or issues in
downstream processing.

Potential Errors and Warnings:

Type Inconsistency: If you have data that does not conform to the expected schema, you might
encounter type conversion issues when performing operations that require numeric types (e.g.,
sorting, aggregations).

Conversion Issues: If you later try to convert this column to int or perform operations assuming it is
an integer, Spark might throw errors or produce incorrect results.

Data Processing Errors:

Type Casting Errors: When performing transformations or actions that require integer operations,
such as mathematical calculations or aggregations, Spark might produce runtime errors or
unexpected behavior if it encounters non-numeric strings in a column expected to be numeric.

Example Code and Error

Assuming you have the following schema and CSV data:

python

Copy code

schema = StructType([

StructField("emp_id", StringType(), True),

StructField("emp_name", StringType(), True),

StructField("emp_age", IntegerType(), True),

StructField("emp_location", StringType(), True)

])
df = spark.read.format("csv").schema(schema).load("path/to/csv")

If your data file looks like this:

sql

Copy code

emp_id,emp_name,emp_age,emp_location

1,John Doe,30,New York

2,Jane Smith,25,Los Angeles

And you have specified emp_id as StringType, no error will occur during the read operation, but
issues may arise later if you attempt to perform operations assuming emp_id is an integer.

How to Resolve the Issue

Correct the Schema: If you realize that the column should be an integer, modify the schema
definition to use IntegerType for the emp_id column.

python

Copy code

schema_correct = StructType([

StructField("emp_id", IntegerType(), True),

StructField("emp_name", StringType(), True),

StructField("emp_age", IntegerType(), True),

StructField("emp_location", StringType(), True)

])

df_correct = spark.read.format("csv").schema(schema_correct).load("path/to/csv")

Cast Column to Correct Type: If you've already loaded the DataFrame and need to correct the column
type, you can use the cast method to convert the column to the appropriate type.

python

Copy code

from pyspark.sql.functions import col


df_corrected = df.withColumn("emp_id", col("emp_id").cast(IntegerType()))

Ensure that all values in the emp_id column are valid integers; otherwise, casting will result in null
values for invalid entries.

Handle Conversion Errors: Before casting, you might want to handle or clean invalid values to avoid
conversion errors:

python

Copy code

from pyspark.sql.functions import when

df_filtered = df.withColumn(

"emp_id",

when(col("emp_id").rlike("^[0-9]+$"), col("emp_id").cast(IntegerType()))

.otherwise(None) # or handle invalid entries differently

Summary

Immediate Effect: No error will be raised just by specifying a wrong type during schema definition,
but you may face issues with data processing or transformations later on.

Resolution: Update the schema to the correct type, cast columns if needed, and handle any
conversion issues to ensure data consistency and correctness.

=============================================================================
multiple csv in one folder with schema change?

Below answer is without schema change--

# Read all CSV files from the folder into a DataFrame

df = spark.read.format("csv") \

.option("header", "true") \ # Indicates that the first row of each CSV file contains column headers

.option("inferSchema", "true") \ # Automatically infers schema from the data; can be used if
schema is unknown

.load(path)

Detailed Explanation

format("csv"): Specifies that the file format is CSV.

option("header", "true"): Indicates that the first line of each CSV file contains column headers. Set
this to "false" if your files do not have headers.

option("inferSchema", "true"): Allows Spark to automatically infer the schema based on the data. If
you have a predefined schema and want to avoid schema inference, you can skip this option or set it
to "false".

load(path): Loads all CSV files from the specified path. Spark will read all files in the folder and
concatenate them into a single DataFrame.

WITH SCHEMA CHANGE

When dealing with multiple CSV files that have different schemas (i.e., some files have more columns
than others), you need to handle schema variations gracefully to ensure you can read all files into a
single DataFrame. Here’s how you can approach this:
Strategies for Handling Schema Variations

Schema Merging: Use schema merging to handle different schemas. Spark can automatically merge
schemas when reading files, but it might not always work perfectly if there are significant differences
between schemas.

Read and Process Files Individually: Read each set of files with their respective schemas and then
combine them.

Define a Union Schema: Define a schema that encompasses all possible columns and use it to read
the files. This schema should include all columns from all files, with some columns possibly
containing null values if they don't exist in all files.

Approach 1: Schema Merging

Spark can merge schemas when reading files. However, this only works if the files have the same
column names and types. If the column names or types differ significantly, schema merging might
not produce the expected results.

python

Copy code

from pyspark.sql import SparkSession

# Create a Spark session

spark = SparkSession.builder \

.appName("ReadCSVWithSchemaMerging") \

.getOrCreate()

# Read all CSV files from the folder with schema merging

df = spark.read.format("csv") \

.option("header", "true") \

.option("mergeSchema", "true") \

.load("path/to/csv/folder")

df.printSchema()
Approach 2: Read and Process Files Individually

If schema merging doesn’t work as expected, you can read files in groups according to their schema
and then merge them.

python

Copy code

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Create a Spark session

spark = SparkSession.builder \

.appName("ReadAndMergeCSV") \

.getOrCreate()

# Define schemas for different sets of files

schema_set1 = StructType([

StructField("emp_id", IntegerType(), True),

StructField("emp_name", StringType(), True),

StructField("emp_age", IntegerType(), True),

StructField("emp_location", StringType(), True),

# ... more columns if necessary

])

schema_set2 = StructType([

StructField("emp_id", IntegerType(), True),

StructField("emp_name", StringType(), True),

# Only the columns that are common in this set of files

])

# Read files with different schemas

df_set1 = spark.read.format("csv") \
.option("header", "true") \

.schema(schema_set1) \

.load("path/to/csv/folder/set1/*.csv")

df_set2 = spark.read.format("csv") \

.option("header", "true") \

.schema(schema_set2) \

.load("path/to/csv/folder/set2/*.csv")

# Union the DataFrames

df_combined = df_set1.unionByName(df_set2, allowMissingColumns=True)

df_combined.printSchema()

Approach 3: Define a Union Schema

Define a comprehensive schema that includes all columns from all files, then read the files using this
schema. Columns not present in some files will be filled with null values.

python

Copy code

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Create a Spark session

spark = SparkSession.builder \

.appName("ReadCSVWithUnionSchema") \

.getOrCreate()

# Define a comprehensive schema including all possible columns

union_schema = StructType([

StructField("emp_id", IntegerType(), True),

StructField("emp_name", StringType(), True),


StructField("emp_age", IntegerType(), True),

StructField("emp_location", StringType(), True),

StructField("additional_column1", StringType(), True), # Include all possible columns

StructField("additional_column2", StringType(), True)

# ... add more fields if necessary

])

# Read all CSV files with the comprehensive schema

df = spark.read.format("csv") \

.option("header", "true") \

.schema(union_schema) \

.load("path/to/csv/folder")

df.printSchema()

Summary

Schema Merging: Use mergeSchema option for automatic schema merging, though it might not
always handle complex cases perfectly.

Read and Process Individually: Handle files with different schemas separately and then union them.

Union Schema: Define a schema that includes all possible columns from all files and use it to read the
files.

read excel with 10 sheets?in 1 df

Here’s how you can read an Excel file with multiple sheets into a single DataFrame in Spark:

Using com.crealytics.spark.excel Library


Add the Spark Excel Library

You need to include the spark-excel library in your Spark session. This can be done by including the
dependency when starting your Spark session.

python

Copy code

spark = SparkSession.builder \

.appName("ReadExcelWithSheets") \

.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.14.0") \

.getOrCreate()

Replace spark-excel_2.12:0.14.0 with the version that matches your Spark and Scala versions. Check
spark-excel documentation for the correct version.

Read Excel Sheets

Since the library supports reading individual sheets, you can read each sheet into a separate
DataFrame and then union them into a single DataFrame. Here’s how you can do it:

python

Copy code

from pyspark.sql import SparkSession

# Create a Spark session

spark = SparkSession.builder \

.appName("ReadExcelWithSheets") \

.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.14.0") \

.getOrCreate()

# Path to the Excel file

excel_path = "path/to/excel/file.xlsx"
# List of sheet names

sheet_names = ["Sheet1", "Sheet2", "Sheet3", "Sheet4", "Sheet5",

"Sheet6", "Sheet7", "Sheet8", "Sheet9", "Sheet10"]

# Read all sheets into separate DataFrames and union them

df_list = []

for sheet in sheet_names:

df = spark.read.format("com.crealytics.spark.excel") \

.option("sheetName", sheet) \

.option("header", "true") \

.load(excel_path)

df_list.append(df)

# Union all DataFrames into one

df_combined = df_list[0]

for df in df_list[1:]:

df_combined = df_combined.unionByName(df, allowMissingColumns=True)

df_combined.printSchema()

Explanation:

option("sheetName", sheet): Specifies the sheet to read.

unionByName(df, allowMissingColumns=True): Combines DataFrames, allowing for missing columns.

Handling Varying Schemas

If the sheets have different schemas, ensure that:

Column names across sheets are consistent if you want to union them directly.

Use allowMissingColumns=True in unionByName to handle cases where columns might be missing in


some DataFrames.

Alternative Approach: Using Pandas


If you prefer or need to use Pandas, you can read the Excel file with Pandas and then convert it to a
Spark DataFrame. This approach is useful if you're working in a non-distributed environment or for
initial data exploration.

python

Copy code

import pandas as pd

from pyspark.sql import SparkSession

# Create a Spark session

spark = SparkSession.builder \

.appName("ReadExcelWithSheets") \

.getOrCreate()

# Path to the Excel file

excel_path = "path/to/excel/file.xlsx"

# Read all sheets into a dictionary of DataFrames

dfs = pd.read_excel(excel_path, sheet_name=None)

# Combine all sheets into a single DataFrame

combined_df = pd.concat(dfs.values(), ignore_index=True)

# Convert Pandas DataFrame to Spark DataFrame

spark_df = spark.createDataFrame(combined_df)

spark_df.printSchema()

Explanation:

pd.read_excel(excel_path, sheet_name=None): Reads all sheets into a dictionary of DataFrames.

pd.concat(dfs.values(), ignore_index=True): Combines all DataFrames.


spark.createDataFrame(combined_df): Converts the combined Pandas DataFrame to a Spark
DataFrame.

Summary

com.crealytics.spark.excel Library: Use it to read and union multiple sheets from an Excel file directly
in Spark.

Pandas: Read Excel with Pandas, combine sheets, and convert to a Spark DataFrame if needed.

Choose the method that best fits your environment and data processing requirements.

You said:

what can be output of this, give exaample

dfs = pd.read_excel(excel_path, sheet_name=None)

ChatGPT said:

ChatGPT

The command pd.read_excel(excel_path, sheet_name=None) in Pandas reads an Excel file and


returns a dictionary where each key is a sheet name and each value is a DataFrame corresponding to
that sheet. This allows you to work with all sheets in the Excel file individually.

Here’s a detailed breakdown and example of the output you can expect from this command:

Example Excel File

Assume you have an Excel file example.xlsx with two sheets:

Sheet1:

lua

Copy code

| ID | Name | Age |

|----|-------|-----|

| 1 | Alice | 30 |
| 2 | Bob | 25 |

Sheet2:

lua

Copy code

| ID | Product | Price |

|----|-----------|-------|

| 1 | Laptop | 1000 |

| 2 | Smartphone| 500 |

Code to Read Excel File

python

Copy code

import pandas as pd

# Path to the Excel file

excel_path = "path/to/example.xlsx"

# Read all sheets into a dictionary of DataFrames

dfs = pd.read_excel(excel_path, sheet_name=None)

# Print the keys of the dictionary to see sheet names

print(dfs.keys())

Output

When you run pd.read_excel(excel_path, sheet_name=None), dfs will be a dictionary with the
following structure:

python

Copy code

'Sheet1': pd.DataFrame({

'ID': [1, 2],


'Name': ['Alice', 'Bob'],

'Age': [30, 25]

}),

'Sheet2': pd.DataFrame({

'ID': [1, 2],

'Product': ['Laptop', 'Smartphone'],

'Price': [1000, 500]

})

Explanation

Keys: The keys of the dictionary are the sheet names from the Excel file ('Sheet1', 'Sheet2' in this
case).

Values: The values are DataFrames corresponding to each sheet. Each DataFrame contains the data
from the respective sheet in the Excel file.

Example Output

Printing the Keys:

python

Copy code

print(dfs.keys())

Output:

css

Copy code

dict_keys(['Sheet1', 'Sheet2'])

This shows that there are two sheets in the Excel file, named 'Sheet1' and 'Sheet2'.

Accessing DataFrames:

python

Copy code

sheet1_df = dfs['Sheet1']
sheet2_df = dfs['Sheet2']

print(sheet1_df)

print(sheet2_df)

Output:

shell

Copy code

# DataFrame for 'Sheet1'

ID Name Age

0 1 Alice 30

1 2 Bob 25

# DataFrame for 'Sheet2'

ID Product Price

0 1 Laptop 1000

1 2 Smartphone 500

Summary

The output of pd.read_excel(excel_path, sheet_name=None) is a dictionary where each key is a


sheet name and each value is a DataFrame with the content of that sheet. This allows you to access
and manipulate the data from multiple sheets easily within your Python code.

You might also like