0% found this document useful (0 votes)

87 views2 pages

Interview Questions

The document contains interview questions for a Spark developer role. It includes questions about data migration, Spark architecture, working with DataFrames, joins, UDFs, and optimization techniques. It also provides code examples for reading datasets in Scala and PySpark.

Uploaded by

scintific things

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views2 pages

Interview Questions

Uploaded by

scintific things

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

Capgemini Spark Developer Real-time interview questions:

1. Explain you’re current project and you’re roles and responsibilities in the
project?

2.What are the considerations has to take while data migration.

3. Explain data migration and processing pipeline in you’re project?

4.Which ETL tools, technologies will use and explain why?

5.What is Spark and explain spark architecture? and convert into you’re project?

6.Write a Spark code to create a new column by applying operations on existing

columns in Data Frame

7.How to write Data Frame? explain different modes to write it?

8.How to rename the columns on Data Frame?

9.Please explain difference between coalesce and repartition? have you used in
project and how?

10.What is Skewness? how to resolve it?

11.Explain Spark optimization techniques?

12.What is broadcast variable? explain with example?

13.What is different types of joins in Apache Spark? explain classical joins?

14.What is broadcast join? explain sort merge join in Spark?

15.How to remove duplicates from Data Frame?

16. If we create a new column and give same name for it which is already exists in
Data Frame, then what will happen?

17.Explain User Defined Functions (UDF) in Spark? have you used in project? if yes
then explain?

18.What is the advantage of Lazy Evaluation in Spark?

19. What are the memory optimization techniques in Spark?

20.Scenari based question: There are 2 Data Frames emp, department and write a code
to join them simply?

21. What is Spark session? how it is initialize?

22. What are the issues you have faced in you’re project and how you resolved
those?
===================================================================================
==================
SCALA READING DATASET
======================
val df = sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema",
"true").load("/usermanteshchougule3333gmail/Spark_Project_Marketing_Analytics/
Input_Data/Marketing_Analysis.csv")

READING DATASET USING PYSPARK

==================================

from pyspark.sql import SparkSession

from pyspark.sql.functions import col,lit,udf
from pyspark.sql.types import IntegerType
from pyspark.sql.types import DoubleType

spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()

df = spark.read.option("inferSchema",True).option("header",True).csv('/FileStore/
tables/StudentData.csv')
df.show()

============================================
injestion job flow
>>SecurityManager athenticates and disable security check>>Utils start spark driver
on respected port eg port 26234
>>spark env register MapOutputTracker and register BlockManagerMaster>>
BlockManagerMaster end point defined Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology information>>
BlockManagerMaster end point is up now>>
diskBlockManager creates temporay directory( eg.DiskBlockManager: Created local
directory at /tmp/blockmgr-299f7dcd-03de-485a-ba60-ac80d8703f3c)
>>memoryStore allocates memory for task(MemoryStore: MemoryStore started with
capacity 7.8 GB)>>
========================================
CI/CD

Overview of a typical Databricks CI/CD pipeline

Develop and commit your code
Configure your agent
Design the pipeline
Define environment variables
Set up the pipeline
Get the latest changes
Develop unit tests
Package library code
Generate and store a deployment artifact
Deploy artifacts
=============================

Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Databricks Data Engineer Associate Notes
No ratings yet
Databricks Data Engineer Associate Notes
5 pages
Azure Data Engineering Interview Q & A - Topicwise
No ratings yet
Azure Data Engineering Interview Q & A - Topicwise
57 pages
Spark QA
No ratings yet
Spark QA
34 pages
Pyspark
No ratings yet
Pyspark
31 pages
Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
MongoDB Mock Tests
100% (1)
MongoDB Mock Tests
17 pages
RHEL 8 3 Deploying Red Hat Enterprise Linux 8 On Public Cloud Platforms
No ratings yet
RHEL 8 3 Deploying Red Hat Enterprise Linux 8 On Public Cloud Platforms
197 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Pyspark Study Material
No ratings yet
Pyspark Study Material
5 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
External Tables
No ratings yet
External Tables
105 pages
A CRM Application To Handle The Clients and Their Property Related Requirements
No ratings yet
A CRM Application To Handle The Clients and Their Property Related Requirements
70 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Ittza7aa UPDATE
No ratings yet
Ittza7aa UPDATE
65 pages
PMSCS Class Routine Summer 2024
No ratings yet
PMSCS Class Routine Summer 2024
1 page
Thomas Calculus 11th Edition - Solutions - Manual 13
No ratings yet
Thomas Calculus 11th Edition - Solutions - Manual 13
88 pages
10 - Enhancements & Modifications
No ratings yet
10 - Enhancements & Modifications
33 pages
Apr - 18 - Cameron - Seader - Kubic CaaS Duo Shaping Enterprise CaaS
No ratings yet
Apr - 18 - Cameron - Seader - Kubic CaaS Duo Shaping Enterprise CaaS
17 pages
Installed Files
No ratings yet
Installed Files
31 pages
ICT Practical Revision Notes
100% (1)
ICT Practical Revision Notes
8 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Red Hat Openstack Platform-16.2-High Availability Deployment and Usage-En-us
No ratings yet
Red Hat Openstack Platform-16.2-High Availability Deployment and Usage-En-us
59 pages
Radware DDOS v1
No ratings yet
Radware DDOS v1
22 pages
IOT2050 How To Firmware Update V1.3
No ratings yet
IOT2050 How To Firmware Update V1.3
11 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Welcome To Drawboard PDF - Read Me!
No ratings yet
Welcome To Drawboard PDF - Read Me!
9 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Real Time Analytics Spark Streaming PDF
No ratings yet
Real Time Analytics Spark Streaming PDF
20 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
SmarTR System
No ratings yet
SmarTR System
3 pages
Introduction To MS Office
100% (1)
Introduction To MS Office
6 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Aws Devops CC
No ratings yet
Aws Devops CC
4 pages
5 Micro-Partitions+and+Clustering
No ratings yet
5 Micro-Partitions+and+Clustering
13 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Abap Dynamic Table
No ratings yet
Abap Dynamic Table
8 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Information Technology TVET Programs (Diploma /level 1-4) : Course Title
No ratings yet
Information Technology TVET Programs (Diploma /level 1-4) : Course Title
3 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Exam 77 727 Excel 2016 Core Dat Analysis Manipulation and Presentation Skills Measured
No ratings yet
Exam 77 727 Excel 2016 Core Dat Analysis Manipulation and Presentation Skills Measured
3 pages
Daily Expense Calc in Java
57% (7)
Daily Expense Calc in Java
19 pages
Performance Benchmarks For ODBC vs. Oracle, MySql, SQL Server PDF
No ratings yet
Performance Benchmarks For ODBC vs. Oracle, MySql, SQL Server PDF
3 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
User Manual: For Model GC810/811/812/820/821/822
No ratings yet
User Manual: For Model GC810/811/812/820/821/822
19 pages
2023-03-08 23-57-52
No ratings yet
2023-03-08 23-57-52
7 pages
PLSQL Introduction Final
No ratings yet
PLSQL Introduction Final
81 pages
Join Stage
No ratings yet
Join Stage
14 pages
Ajay Resume VLaF
No ratings yet
Ajay Resume VLaF
2 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Databricks
No ratings yet
Databricks
11 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
TNC Remo PDF
No ratings yet
TNC Remo PDF
1 page
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Tafj
100% (2)
Tafj
4 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
ICT Training Matrix
100% (44)
ICT Training Matrix
2 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
C&G Rate Card Display
No ratings yet
C&G Rate Card Display
2 pages
A Data Pipeline Should Address These Issues:: Topics To Study
No ratings yet
A Data Pipeline Should Address These Issues:: Topics To Study
10 pages
Dharmesh Soni
No ratings yet
Dharmesh Soni
5 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Datastage Questions
No ratings yet
Datastage Questions
18 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
PySpark Questions
No ratings yet
PySpark Questions
5 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Srikanth
No ratings yet
Srikanth
7 pages
Tuning SQL Queries - Oracle
100% (1)
Tuning SQL Queries - Oracle
27 pages
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
From Everand
Expert Strategies in Apache Spark: Comprehensive Data Processing and Advanced Analytics
Adam Jones
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet

Interview Questions

Uploaded by

Interview Questions

Uploaded by

Capgemini Spark Developer Real-time interview questions:

2.What are the considerations has to take while data migration.

3. Explain data migration and processing pipeline in you’re project?

4.Which ETL tools, technologies will use and explain why?

6.Write a Spark code to create a new column by applying operations on existing

7.How to write Data Frame? explain different modes to write it?

8.How to rename the columns on Data Frame?

10.What is Skewness? how to resolve it?

11.Explain Spark optimization techniques?

12.What is broadcast variable? explain with example?

13.What is different types of joins in Apache Spark? explain classical joins?

14.What is broadcast join? explain sort merge join in Spark?

15.How to remove duplicates from Data Frame?

18.What is the advantage of Lazy Evaluation in Spark?

19. What are the memory optimization techniques in Spark?

21. What is Spark session? how it is initialize?

READING DATASET USING PYSPARK

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()

Overview of a typical Databricks CI/CD pipeline

You might also like