0% found this document useful (0 votes)
87 views2 pages

Interview Questions

The document contains interview questions for a Spark developer role. It includes questions about data migration, Spark architecture, working with DataFrames, joins, UDFs, and optimization techniques. It also provides code examples for reading datasets in Scala and PySpark.

Uploaded by

scintific things
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views2 pages

Interview Questions

The document contains interview questions for a Spark developer role. It includes questions about data migration, Spark architecture, working with DataFrames, joins, UDFs, and optimization techniques. It also provides code examples for reading datasets in Scala and PySpark.

Uploaded by

scintific things
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

Capgemini Spark Developer Real-time interview questions:

1. Explain you’re current project and you’re roles and responsibilities in the
project?

2.What are the considerations has to take while data migration.

3. Explain data migration and processing pipeline in you’re project?

4.Which ETL tools, technologies will use and explain why?

5.What is Spark and explain spark architecture? and convert into you’re project?

6.Write a Spark code to create a new column by applying operations on existing


columns in Data Frame

7.How to write Data Frame? explain different modes to write it?

8.How to rename the columns on Data Frame?

9.Please explain difference between coalesce and repartition? have you used in
project and how?

10.What is Skewness? how to resolve it?

11.Explain Spark optimization techniques?

12.What is broadcast variable? explain with example?

13.What is different types of joins in Apache Spark? explain classical joins?

14.What is broadcast join? explain sort merge join in Spark?

15.How to remove duplicates from Data Frame?

16. If we create a new column and give same name for it which is already exists in
Data Frame, then what will happen?

17.Explain User Defined Functions (UDF) in Spark? have you used in project? if yes
then explain?

18.What is the advantage of Lazy Evaluation in Spark?

19. What are the memory optimization techniques in Spark?

20.Scenari based question: There are 2 Data Frames emp, department and write a code
to join them simply?

21. What is Spark session? how it is initialize?

22. What are the issues you have faced in you’re project and how you resolved
those?
===================================================================================
==================
SCALA READING DATASET
======================
val df = sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema",
"true").load("/usermanteshchougule3333gmail/Spark_Project_Marketing_Analytics/
Input_Data/Marketing_Analysis.csv")

READING DATASET USING PYSPARK


==================================

from pyspark.sql import SparkSession


from pyspark.sql.functions import col,lit,udf
from pyspark.sql.types import IntegerType
from pyspark.sql.types import DoubleType

spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()

df = spark.read.option("inferSchema",True).option("header",True).csv('/FileStore/
tables/StudentData.csv')
df.show()

============================================
injestion job flow
>>SecurityManager athenticates and disable security check>>Utils start spark driver
on respected port eg port 26234
>>spark env register MapOutputTracker and register BlockManagerMaster>>
BlockManagerMaster end point defined Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology information>>
BlockManagerMaster end point is up now>>
diskBlockManager creates temporay directory( eg.DiskBlockManager: Created local
directory at /tmp/blockmgr-299f7dcd-03de-485a-ba60-ac80d8703f3c)
>>memoryStore allocates memory for task(MemoryStore: MemoryStore started with
capacity 7.8 GB)>>
========================================
CI/CD

Overview of a typical Databricks CI/CD pipeline


Develop and commit your code
Configure your agent
Design the pipeline
Define environment variables
Set up the pipeline
Get the latest changes
Develop unit tests
Package library code
Generate and store a deployment artifact
Deploy artifacts
=============================

You might also like