Pyspark and SQL
Pyspark and SQL
SELECT *
FROM employees
WHERE department_id IS NULL;
SELECT *
FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY
SALARY DESC) AS salary_rank
FROM EMPLOYEE
) AS ranked_salaries
WHERE salary_rank = N; -- Replace N with the
desired rank
SELECT *
FROM employees
WHERE join_date < '2010-01-01' AND salary >
50000;
Q9. How do you broadcast variables in Spark, and when should you
use them?
In Spark, broadcast variables are used to efficiently share small read-only
data (like lookup tables or configuration settings) with all worker nodes,
without sending a copy for each task.
Q10. What are accumulators in PySpark, and how do they differ from
broadcast variables?
Feature Accumulators Broadcast Variables
Aggregation (e.g.,
Purpose Share read-only data with executors
counters, sums)
Tasks can only add
Mutable? Completely read-only
values
Driver only can read
Access All tasks can read it
value
Usage in Tasks Write-only in workers Read-only in workers
Common Use Metrics, debugging, Lookup tables, configs, small
Cases counting conditions datasets
df.createOrReplaceTempView("temp_view")
spark.sql("SELECT * FROM temp_view WHERE age > 30").show()
def to_uppercase(s):
return s.upper()
df.createOrReplaceTempView("people")
spark.sql("SELECT name, to_uppercase_sql(name) AS name_upper FROM
people").show()
Q20. How do you read and write data in Parquet, CSV, and
JSON formats in PySpark?
Read:
df_parquet = spark.read.parquet("path/to/file.parquet")
df_csv = spark.read.option("header", "true").csv("path/to/file.csv")
df_json = spark.read.json("path/to/file.json")
Write:
df.write.mode("overwrite").parquet("path/to/output_parquet")
df.write.option("header",
"true").mode("overwrite").csv("path/to/output_csv")
df.write.mode("overwrite").json("path/to/output_json")
1. Logical Optimization
2. Physical Plan Optimization
3. Rule-Based and Cost-Based Optimization
Uses rule-based techniques (static transformations) and optionally cost-
based optimization (CBO) to make smarter choices.
4. Extensibility
Automatic optimization (you don’t need to tune manually)
Improved performance for complex SQL/DataFrame queries
Extensible for custom logic in enterprise environments
df = spark.read \
.option("inferSchema", "true") \
.json("path/to/file.json")
Manual Schema Definition (Recommended for Large Data)
You can define a StructType schema to explicitly specify data types and
improve performance.
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
Q24. What are the different join types in Spark SQL, and when
would you use each?
Static partition pruning: Prunes partitions before query starts (e.g., WHERE
region = 'US').
DPP (Dynamic Partition pruning): Prunes partitions during execution,
based on values coming from another table.
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled",
"true")
2. Broadcast Join
If one table is small, broadcast it to avoid shuffling and reduce skew impact.
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "join_key")
6. spark_partition_id()
In Apache Spark, spark_partition_id() is a built-in function that returns the
partition ID (an integer) of the row it is associated with. It is particularly
useful for debugging, understanding data distribution, and optimizing
performance.
import org.apache.spark.sql.functions.spark_partition_id
val df = spark.range(0, 10).repartition(3)
df.withColumn("partition_id", spark_partition_id()).show()
# 4. Optional aggregation
agg_df = sales_df.groupBy("product_id").sum("sales_amount")
Follow Me | Subhash Yadav |Big Data Engineer
INTERVIEW QUESTIONS & WITH ANSWER
# 5. Write to output sink (console or storage)
query = agg_df.writeStream \
.outputMode("complete") \
.format("console") \
.option("truncate", "false") \
.trigger(processingTime="10 seconds") \
.start()
query.awaitTermination()
Triggers.
trigger(processingTime="10 seconds") # every 10 seconds
.trigger(once=True) # one-time batch for debugging
Fault Tolerance
.writeStream.option("checkpointLocation", "/tmp/checkpoints/")
Q33. What are the best practices for partitioning data in large
datasets?
5. Optimize Shuffles
spark.conf.set("spark.sql.shuffle.partitions", 200) # default, increase or
decrease based on data size
spark.conf.set("spark.dynamicAllocation.enabled", "true")
schema_v2 = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("email", StringType(), True) # new column
Follow Me | Subhash Yadav |Big Data Engineer
INTERVIEW QUESTIONS & WITH ANSWER
])
df = spark.read.schema(schema_v2).json("path")
Data Checkpointing:-
Saves the RDD lineage and data to avoid recomputation.
Mainly used in DStream-based Spark Streaming (less common now).
query = df.writeStream \
Purpose Description
Fault Recovers the stream from failures by storing metadata
Recovery and data state
Stateful Required for operations like updateStateByKey,
Operations mapGroupsWithState
Progress Tracks offsets (Kafka, file source), watermarks, and
Tracking batch info
df = spark.readStream \
Follow Me | Subhash Yadav |Big Data Engineer
INTERVIEW QUESTIONS & WITH ANSWER
.format("kafka") \
.option("subscribe", "orders") \
.load()
parsed = df \
.withWatermark("event_time", "10 minutes") \
.groupBy(window("event_time", "5 minutes")) \
.count()
target.alias("t").merge(
source=new_data.alias("s"),
condition="t.id = s.id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
df = spark.read.parquet("s3://bucket/orders/")
incremental_df = df.filter(df["order_date"] > last_processed_date)
Add a salt key (e.g., key + rand()) to spread out skewed data.
Use salting or enable AQE skew join handling (in Spark 3.0+):
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.sql("""
SELECT o.*, c.name, p.name
FROM orders o
JOIN customers c ON o.cust_id = c.id
JOIN products p ON o.prod_id = p.id
""")
8. Tune Configurations
Setting Purpose
spark.sql.shuffle.partitions Controls # of shuffle partitions
spark.sql.autoBroadcastJoinThreshold Max size (bytes) for broadcast
spark.sql.adaptive.enabled Enables Adaptive Query Execution
spark.sql.adaptive.skewJoin.enabled Handles skewed joins automatically
1. Data Encryption
At Rest
Enable encryption on storage systems like:
Amazon S3 (SSE-S3, SSE-KMS)
HDFS transparent encryption
Azure Data Lake encryption
Use encrypted file formats like Parquet + GZIP/Snappy.
In Transit
key = Fernet.generate_key()
cipher = Fernet(key)
@udf("string")
def encrypt(value):
return cipher.encrypt(value.encode()).decode()
df = df.withColumn("ssn_encrypted", encrypt(col("ssn")))
Store encryption keys in a secure vault (e.g., AWS KMS, Azure Key Vault,
HashiCorp Vault).
4. Access Control
Use Role-Based Access Control (RBAC):
On data storage (S3, ADLS, Hive, etc.)
On Databricks / Spark clusters
Apply fine-grained access control via:
Apache Ranger (for HDFS, Hive, etc.)
Unity Catalog (Databricks)
5. Auditing and Logging
Log:
7. DevSecOps Practices
Don't hardcode credentials in scripts.
Use secrets managers:
spark.conf.set("spark.hadoop.fs.s3a.access.key", ...) via environment vars or
secret scopes.
Encrypt logs and control log verbosity.
Security Measure Technique / Tool
Encryption (at rest) S3/KMS, HDFS encryption
2. Partitioning
Use repartition(n) to increase partitions (e.g., after a wide transformation).
Use coalesce(n) to reduce the number of partitions (e.g., before writing).
8. Optimize Joins
Follow Me | Subhash Yadav |Big Data Engineer
INTERVIEW QUESTIONS & WITH ANSWER
Ensure the join keys are distributed and avoid skewed joins.
Use salting or skew join hints when facing data skew.
df1.join(df2.hint("skew"), "key")
--executor-memory 4G
--executor-cores 4
--num-executors 10
3. Time Travel
Access previous versions of data using versioning or timestamps.
Useful for debugging, rollback, and reproducibility.
delta_table = DeltaTable.forPath(spark, "/path")
delta_table.history() # Show all versions
spark.read.format("delta").option("versionAsOf", 3).load("/path")
1. Using versionAsOf
df = spark.read.format("delta") \
.option("versionAsOf", 5) \
.load("/path/to/delta-table")
2. Using timestampAsOf
Follow Me | Subhash Yadav |Big Data Engineer
INTERVIEW QUESTIONS & WITH ANSWER
df = spark.read.format("delta") \
.option("timestampAsOf", "2024-05-20T10:00:00") \
.load("/path/to/delta-table")
4. Notes
Delta Lake stores all changes as incremental commits in the _delta_log/
directory.
Older data is retained by default for 30 days, but this is configurable with the
data retention period
spark.databricks.delta.retentionDurationCheck.enabled = false
window_spec =
Window.partitionBy("customer_id").orderBy("transaction_date") \
Follow Me | Subhash Yadav |Big Data Engineer
INTERVIEW QUESTIONS & WITH ANSWER
.rowsBetween(Window.unboundedPreceding,
Window.currentRow)
df = df.withColumn("running_total", sum("amount").over(window_spec))
2. Moving Average
from pyspark.sql.functions import avg
window_spec =
Window.partitionBy("customer_id").orderBy("transaction_date") \
.rowsBetween(-2, 0) # 3-day moving average
df = df.withColumn("moving_avg", avg("amount").over(window_spec))
window_spec = Window.partitionBy("category").orderBy("sales")
df = df.withColumn("row_num", row_number().over(window_spec)) \
.withColumn("rank", rank().over(window_spec)) \
.withColumn("dense_rank", dense_rank().over(window_spec))
window_spec = Window.partitionBy("user_id").orderBy("event_time")
window_spec = Window.partitionBy("user_id").orderBy("event_time")
df = df.withColumn("prev_status", lag("status").over(window_spec)) \
Follow Me | Subhash Yadav |Big Data Engineer
INTERVIEW QUESTIONS & WITH ANSWER
.withColumn("status_changed", when(col("status") != col("prev_status"),
1).otherwise(0))
window_spec = Window.partitionBy("department").orderBy("date")
df = df.withColumn("first_sale", first("sale").over(window_spec)) \
.withColumn("last_sale", last("sale").over(window_spec))
Function Description
`row_number()` Unique row number per partition
`rank()` Ranking with gaps
`dense_rank()` Ranking without gaps
`lag()` Value from a previous row
`lead()` Value from a following row
`sum()` Cumulative or windowed sum
`avg()` Moving or group average
`first()` First value in the window
`last()` Last value in the window
df.groupBy(
window("event_time", "10 minutes"),
"user_id"
).agg(sum("amount"))
4. Deduplication
df.dropDuplicates(["user_id", "event_time"])
5. Role of Watermarking
Watermarking helps limit state size by specifying the maximum expected
lateness of data.
df.withWatermark("event_time", "15 minutes")
error_count = spark.sparkContext.accumulator(0)
def parse_and_count(value):
try:
return int(value)
except:
error_count.add(1)
return None
udf_parse = udf(parse_and_count)
df = df.withColumn("parsed", udf_parse(df["col"]))
Area Strategy
Driver code Try-except with logging
External systems Retry with exponential backoff
UDFs Safe exception handling inside logic
Streaming Use check pointing and watermarking
Data quality Validate schema and critical fields early
Workflow orchestration Handle retries and notifications externally
3. Databricks
Available as part of the "Spark UI" tab inside each job/run.
provider_class=spark.conf.get("spark.sql.streaming.statestore.providerclass")
Key Features:
Key Features: