KPMG Interview
KPMG Interview
"emp_id": "KGS123",
"emp_contact": "9876543210",
"emp_location": "Bangalore"}
emp_detail_dict_clone = emp_detail_dict
emp_detail_dict["emp_contact"] = "0123456789"
emp_detail_dict["emp_location"] = "Kochin"
print(emp_detail_dict_clone)
Assignment: emp_detail_dict_clone = emp_detail_dict does not create a copy but rather assigns
emp_detail_dict_clone to the same dictionary object as emp_detail_dict. Both variables now point
to the same dictionary in memory.
Modification: When you update emp_detail_dict, you are modifying the original dictionary object.
Since emp_detail_dict_clone references the same object, these changes are reflected when you
access emp_detail_dict_clone.
Output:
if done
# Creating a shallow copy
emp_detail_dict_clone = copy.copy(emp_detail_dict)
in this case, the changes to emp_detail_dict are not reflected in emp_detail_dict_clone because
copy.copy() created a new dictionary object but with the same references to nested objects.
if emp_detail_dict_clone = copy.deepcopy(emp_detail_dict)
Here, copy.deepcopy() creates a completely independent copy of the dictionary, so changes to the
original dictionary do not affect the deep copy.
+------+------+-------------+
| 1| Gokul|+919098903987|
| 2|Deepak|+917898920928|
| 12| Sunil|+918987678923|
| 10| Pawan|+443290390924|
+------+------+-------------+
==================================================================================
=======
Table: OnlineCourses
|CourseID|EnrollmentDate|StudentID|CompletionDate|
+--------+--------------+---------+--------------+
| 1| 2021-01-01| 1| 2021-01-15|
| 1| 2021-01-01| 2| null|
| 1| 2021-01-01| 3| 2021-01-10|
| 2| 2021-02-01| 1| 2021-02-10|
| 2| 2021-02-01| 2| null|
| 2| 2021-02-01| 3| null|
| 3| 2021-03-01| 1| 2021-03-10|
| 3| 2021-03-01| 2| 2021-03-15|
| 3| 2021-03-01| 3| null|
+--------+--------------+---------+--------------+
sql
Copy code
SELECT
CourseID,
COUNT(*) AS TotalEnrollments,
ROUND(
(1.0 * SUM(CASE WHEN CompletionDate IS NULL THEN 1 ELSE 0 END) / COUNT(*)) * 100,
) AS DropRate
FROM
OnlineCourses
GROUP BY
CourseID;
Explanation:
TotalEnrollments: COUNT(*) gives the total number of enrollments for each course.
TotalCompletions: SUM(CASE WHEN CompletionDate IS NOT NULL THEN 1 ELSE 0 END) counts the
number of students who completed the course.
-------counts the number of students who did not complete the course.
COUNT(*) gives the total number of enrollments.
1.0 * ensures that the division is done in floating-point arithmetic to get a decimal result.
(1.0 * SUM(CASE WHEN CompletionDate IS NULL THEN 1 ELSE 0 END) / COUNT(*)) * 100 computes
the percentage of students who dropped out.
This query will give you the CourseID, total number of enrollments, total number of completions, and
the drop rate as a percentage for each course in the OnlineCourses table.
Given Data
1 2021-01-01 1 2021-01-15
1 2021-01-01 2 NULL
1 2021-01-01 3 2021-01-10
2 2021-02-01 1 2021-02-10
2 2021-02-01 2 NULL
2 2021-02-01 3 NULL
3 2021-03-01 1 2021-03-10
3 2021-03-01 2 2021-03-15
3 2021-03-01 3 NULL
SELECT
CourseID,
COUNT(*) AS TotalEnrollments,
ROUND(
(1.0 * SUM(CASE WHEN CompletionDate IS NULL THEN 1 ELSE 0 END) / COUNT(*)) * 100,
) AS DropRate
FROM
OnlineCourses
GROUP BY
CourseID;
Step-by-Step Calculation
For CourseID 1:
DropRate:
For CourseID 2:
DropRate:
For CourseID 3:
DropRate:
Result Set
1 3 2 33.33
2 3 1 66.67
3 3 2 33.33
This result shows the total number of enrollments, completions, and the calculated drop rate for
each course.
=========================================================================
In Apache Spark, the partitioning of data when reading from a file is influenced by several factors.
Even if you have a small file (e.g., 10 MB) and the default maximum partition size is 128 MB, the file
may still be divided into multiple partitions. Here are some reasons why this can happen:
Spark's partitioning strategy is not solely based on file size. The default behavior when reading files
into a DataFrame or RDD involves creating partitions based on the file's block size and other factors,
such as the number of splits Spark makes.
By default, Spark may create multiple partitions based on the underlying file system's block size and
the way it splits files. This can result in files being split into multiple partitions even if they are smaller
than the maximum partition size.
Hadoop Configuration:
In the case of files read from Hadoop-compatible file systems (like HDFS), the Hadoop block size
configuration can influence how Spark splits the file. If the Hadoop block size is smaller than Spark's
partition size, it can affect the number of partitions created.
Example Configuration
To verify how many partitions a DataFrame has and to investigate further, you can use the following
code:
python
Copy code
df = spark.read.format("csv").option("header", "true").load("path/to/your/file.csv")
num_partitions = df.rdd.getNumPartitions()
Summary
Even though the file size (10 MB) is smaller than the default maximum partition size (128 MB), Spark
may still create multiple partitions due to its internal partitioning strategy, file format, block size, and
configuration settings. Understanding these factors can help you fine-tune partitioning behavior in
your Spark jobs.
======================================================================
managed identity?
A managed identity is a feature provided by cloud platforms like Microsoft Azure that allows services
and applications to authenticate and interact with other services securely, without the need to
manage credentials directly. Managed identities simplify the authentication process by providing a
system-assigned or user-assigned identity that can be used to access resources and services.
==============================================================================
df = spark.read.format("csv").load(path)
2) df =
spark.read.format("csv").option("inferSchema",True).load(path)
3) df = spark.read.format("csv").schema(schema).load(path)
which will have used more processing time and compute power?
how many jobs for three lines above?
CSV Reading with Schema Inference: More processing time and compute power due to the additional
schema inference step.
CSV Reading with Explicit Schema: Generally the most efficient, as it avoids the overhead of schema
inference.
Number of Jobs:
CSV Reading with Schema Inference: Typically 2 jobs (one for schema inference, one for the actual
data reading).
By specifying the schema explicitly, you can reduce the overall processing time and computational
overhead, especially for large datasets or when schema inference is costly.
===========================================================================
file_path = "/mnt/rawcontainer/empinfo/empdata.csv"
What Happens
Schema Mismatch During Data Read: When you provide a schema with emp_id as string while your
data actually contains integer values for emp_id, Spark will interpret the emp_id column as a string.
This mismatch may not immediately raise an error but can lead to data inconsistency or issues in
downstream processing.
Type Inconsistency: If you have data that does not conform to the expected schema, you might
encounter type conversion issues when performing operations that require numeric types (e.g.,
sorting, aggregations).
Conversion Issues: If you later try to convert this column to int or perform operations assuming it is
an integer, Spark might throw errors or produce incorrect results.
Type Casting Errors: When performing transformations or actions that require integer operations,
such as mathematical calculations or aggregations, Spark might produce runtime errors or
unexpected behavior if it encounters non-numeric strings in a column expected to be numeric.
python
Copy code
schema = StructType([
])
df = spark.read.format("csv").schema(schema).load("path/to/csv")
sql
Copy code
emp_id,emp_name,emp_age,emp_location
And you have specified emp_id as StringType, no error will occur during the read operation, but
issues may arise later if you attempt to perform operations assuming emp_id is an integer.
Correct the Schema: If you realize that the column should be an integer, modify the schema
definition to use IntegerType for the emp_id column.
python
Copy code
schema_correct = StructType([
])
df_correct = spark.read.format("csv").schema(schema_correct).load("path/to/csv")
Cast Column to Correct Type: If you've already loaded the DataFrame and need to correct the column
type, you can use the cast method to convert the column to the appropriate type.
python
Copy code
Ensure that all values in the emp_id column are valid integers; otherwise, casting will result in null
values for invalid entries.
Handle Conversion Errors: Before casting, you might want to handle or clean invalid values to avoid
conversion errors:
python
Copy code
df_filtered = df.withColumn(
"emp_id",
when(col("emp_id").rlike("^[0-9]+$"), col("emp_id").cast(IntegerType()))
Summary
Immediate Effect: No error will be raised just by specifying a wrong type during schema definition,
but you may face issues with data processing or transformations later on.
Resolution: Update the schema to the correct type, cast columns if needed, and handle any
conversion issues to ensure data consistency and correctness.
=============================================================================
multiple csv in one folder with schema change?
df = spark.read.format("csv") \
.option("header", "true") \ # Indicates that the first row of each CSV file contains column headers
.option("inferSchema", "true") \ # Automatically infers schema from the data; can be used if
schema is unknown
.load(path)
Detailed Explanation
option("header", "true"): Indicates that the first line of each CSV file contains column headers. Set
this to "false" if your files do not have headers.
option("inferSchema", "true"): Allows Spark to automatically infer the schema based on the data. If
you have a predefined schema and want to avoid schema inference, you can skip this option or set it
to "false".
load(path): Loads all CSV files from the specified path. Spark will read all files in the folder and
concatenate them into a single DataFrame.
When dealing with multiple CSV files that have different schemas (i.e., some files have more columns
than others), you need to handle schema variations gracefully to ensure you can read all files into a
single DataFrame. Here’s how you can approach this:
Strategies for Handling Schema Variations
Schema Merging: Use schema merging to handle different schemas. Spark can automatically merge
schemas when reading files, but it might not always work perfectly if there are significant differences
between schemas.
Read and Process Files Individually: Read each set of files with their respective schemas and then
combine them.
Define a Union Schema: Define a schema that encompasses all possible columns and use it to read
the files. This schema should include all columns from all files, with some columns possibly
containing null values if they don't exist in all files.
Spark can merge schemas when reading files. However, this only works if the files have the same
column names and types. If the column names or types differ significantly, schema merging might
not produce the expected results.
python
Copy code
spark = SparkSession.builder \
.appName("ReadCSVWithSchemaMerging") \
.getOrCreate()
# Read all CSV files from the folder with schema merging
df = spark.read.format("csv") \
.option("header", "true") \
.option("mergeSchema", "true") \
.load("path/to/csv/folder")
df.printSchema()
Approach 2: Read and Process Files Individually
If schema merging doesn’t work as expected, you can read files in groups according to their schema
and then merge them.
python
Copy code
spark = SparkSession.builder \
.appName("ReadAndMergeCSV") \
.getOrCreate()
schema_set1 = StructType([
])
schema_set2 = StructType([
])
df_set1 = spark.read.format("csv") \
.option("header", "true") \
.schema(schema_set1) \
.load("path/to/csv/folder/set1/*.csv")
df_set2 = spark.read.format("csv") \
.option("header", "true") \
.schema(schema_set2) \
.load("path/to/csv/folder/set2/*.csv")
df_combined.printSchema()
Define a comprehensive schema that includes all columns from all files, then read the files using this
schema. Columns not present in some files will be filled with null values.
python
Copy code
spark = SparkSession.builder \
.appName("ReadCSVWithUnionSchema") \
.getOrCreate()
union_schema = StructType([
])
df = spark.read.format("csv") \
.option("header", "true") \
.schema(union_schema) \
.load("path/to/csv/folder")
df.printSchema()
Summary
Schema Merging: Use mergeSchema option for automatic schema merging, though it might not
always handle complex cases perfectly.
Read and Process Individually: Handle files with different schemas separately and then union them.
Union Schema: Define a schema that includes all possible columns from all files and use it to read the
files.
Here’s how you can read an Excel file with multiple sheets into a single DataFrame in Spark:
You need to include the spark-excel library in your Spark session. This can be done by including the
dependency when starting your Spark session.
python
Copy code
spark = SparkSession.builder \
.appName("ReadExcelWithSheets") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.14.0") \
.getOrCreate()
Replace spark-excel_2.12:0.14.0 with the version that matches your Spark and Scala versions. Check
spark-excel documentation for the correct version.
Since the library supports reading individual sheets, you can read each sheet into a separate
DataFrame and then union them into a single DataFrame. Here’s how you can do it:
python
Copy code
spark = SparkSession.builder \
.appName("ReadExcelWithSheets") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.14.0") \
.getOrCreate()
excel_path = "path/to/excel/file.xlsx"
# List of sheet names
df_list = []
df = spark.read.format("com.crealytics.spark.excel") \
.option("sheetName", sheet) \
.option("header", "true") \
.load(excel_path)
df_list.append(df)
df_combined = df_list[0]
for df in df_list[1:]:
df_combined.printSchema()
Explanation:
Column names across sheets are consistent if you want to union them directly.
python
Copy code
import pandas as pd
spark = SparkSession.builder \
.appName("ReadExcelWithSheets") \
.getOrCreate()
excel_path = "path/to/excel/file.xlsx"
spark_df = spark.createDataFrame(combined_df)
spark_df.printSchema()
Explanation:
Summary
com.crealytics.spark.excel Library: Use it to read and union multiple sheets from an Excel file directly
in Spark.
Pandas: Read Excel with Pandas, combine sheets, and convert to a Spark DataFrame if needed.
Choose the method that best fits your environment and data processing requirements.
You said:
ChatGPT said:
ChatGPT
Here’s a detailed breakdown and example of the output you can expect from this command:
Sheet1:
lua
Copy code
| ID | Name | Age |
|----|-------|-----|
| 1 | Alice | 30 |
| 2 | Bob | 25 |
Sheet2:
lua
Copy code
| ID | Product | Price |
|----|-----------|-------|
| 1 | Laptop | 1000 |
| 2 | Smartphone| 500 |
python
Copy code
import pandas as pd
excel_path = "path/to/example.xlsx"
print(dfs.keys())
Output
When you run pd.read_excel(excel_path, sheet_name=None), dfs will be a dictionary with the
following structure:
python
Copy code
'Sheet1': pd.DataFrame({
}),
'Sheet2': pd.DataFrame({
})
Explanation
Keys: The keys of the dictionary are the sheet names from the Excel file ('Sheet1', 'Sheet2' in this
case).
Values: The values are DataFrames corresponding to each sheet. Each DataFrame contains the data
from the respective sheet in the Excel file.
Example Output
python
Copy code
print(dfs.keys())
Output:
css
Copy code
dict_keys(['Sheet1', 'Sheet2'])
This shows that there are two sheets in the Excel file, named 'Sheet1' and 'Sheet2'.
Accessing DataFrames:
python
Copy code
sheet1_df = dfs['Sheet1']
sheet2_df = dfs['Sheet2']
print(sheet1_df)
print(sheet2_df)
Output:
shell
Copy code
ID Name Age
0 1 Alice 30
1 2 Bob 25
ID Product Price
0 1 Laptop 1000
1 2 Smartphone 500
Summary