My Walmart Interviewexperience Answers
My Walmart Interviewexperience Answers
Follow me Here:
LinkedIn: https://www.linkedin.com/in/ajay026/
https://lnkd.in/gU5NkCqi
My Walmart Data
Engineer Interview
Experience & Answers
5. Can you provide a PySpark code snippet that reads data from
a Delta Lake and performs a transformation?
Answer:
spark = SparkSession.builder \
.appName("DeltaLakeExample") \
.getOrCreate()
df = spark.read.format("delta").load("/path/to/delta-table")
# Perform transformation
transformed_df.show()
5. Can you provide an advanced SQL query that retrieves the top
5 products by sales in each category?
Answer:
FROM (
FROM sales_table
) ranked_sales
Question:
Answer:
1. Front-End Layer:
2. Application Layer:
3. Data Layer:
4. Caching Layer:
6. Security:
7. CI/CD Pipeline:
Question:
Answer:
1. Data Ingestion:
o Azure IoT Hub: Serve as the central message hub for bi-
directional communication between IoT devices and the
cloud.
2. Stream Processing:
3. Data Storage:
5. Machine Learning:
SQL Question:
Question:
Answer:
WITH RecentSales AS (
SELECT
ProductID,
SUM(Amount) AS TotalSales
FROM
WHERE
GROUP BY
ProductID
SELECT
ProductID,
TotalSales
FROM
RecentSales
ORDER BY
TotalSales DESC
This query calculates the total sales amount for each product in the
last 30 days and retrieves the top 3 products with the highest sales.
Question:
Using PySpark, how would you detect and remove duplicate records
from a DataFrame based on a composite key consisting of columnA
and columnB, keeping only the latest record based on a timestamp
column timestampCol?
Answer:
spark =
SparkSession.builder.appName("DeduplicateDataFrame").getOrCreat
e()
window_spec = Window.partitionBy("columnA",
"columnB").orderBy(df["timestampCol"].desc())
df_with_row_num = df.withColumn("row_num",
row_number().over(window_spec))
deduplicated_df =
df_with_row_num.filter(df_with_row_num["row_num"] ==
1).drop("row_num")
deduplicated_df.show()
https://topmate.io/ajay_kadiyala