0% found this document useful (0 votes)
2 views

Python, Pyspark,SQL

The document contains a series of technical questions related to data processing, Spark architecture, and Databricks features. It covers topics such as transformations, data loading techniques, SQL queries, and optimization strategies. Additionally, it includes inquiries about recent projects, workflows, and data management practices.

Uploaded by

Shobhit
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Python, Pyspark,SQL

The document contains a series of technical questions related to data processing, Spark architecture, and Databricks features. It covers topics such as transformations, data loading techniques, SQL queries, and optimization strategies. Additionally, it includes inquiries about recent projects, workflows, and data management practices.

Uploaded by

Shobhit
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

1.What is difference between Narrow and wide Transformation?

2.What is spark architecture


3 What is Initial, incremental and delta load?
4.What is the difference between incremental and delta load?
5.In which layer initial and incremental load takes place?
6. Write a code for incremental load
7.Concept of shuffling and stages in spark
8. Given two tables named "orders" and "order_details" with columns (order_id,
customer_id, order_date) and (order_id, product_id, quantity, unit_price), write an
SQL query to find the total revenue generated by each customer in the year 2023.
9.Table A: 1,1 Table 2 : 1,1,1
Write values of full join, inner join, left join, right join
10.Concept of DAG and lineage graph

1) What are workflows in Databricks?


2) What is Unity Catalog in Databricks, and what are its features?
3) What are the 4 Vs of Big Data?
4) If we have 1 driver and 3 workers and a dataset with 100 records, how many
partitions would be created for perfect distribution?
5) What are Spark optimization techniques?
6) Can you explain broadcast join with a real-world scenario?
7) How do you configure cluster settings?
8) Can you describe the domains of projects you've worked on?
9) What is pivoting in the context of data processing?
10) partitioning?

Recent project explanations


Questions related to recent projects:
What is Unity Catalog and how does it differ from Hive Metastore?
How do you maintain logging?
Steps for cost optimization.
Real problems related to concurrency control.
Scenario-based questions:
How would you fetch streaming data every 2 minutes from an API and ingest it into
Databricks? Write a step by step process.
Cross questions included:
Data Cleansing
Handling dirty/corrupt data
Reverse ETL
Medallion Architecture in detail
Cluster Configurations

You might also like