KBKrishnaTeja Interview Questions

The document provides answers to questions about PySpark and SQL concepts like transformations vs actions, reading CSV files, optimizing joins, different types of joins, filtering DataFrames, Parquet and Avro formats, ETL and data warehousing processes, star vs snowflake schemas, data catalogs, Amazon Kinesis vs SQS, user-defined functions, caching DataFrames, AWS services like Redshift and DynamoDB, the HAVING clause, and the roles of Hive and Hadoop in big data processing and how they interact.

Uploaded by

nikhildevopspro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views2 pages

KBKrishnaTeja Interview Questions

Uploaded by

nikhildevopspro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

1) What is the difference between transformations and actions in PySpark, and when you

would use each?

Ans: Transformations are operations on RDDs or DataFrames that create a new RDD or
DataFrame, while actions return a value to the driver program. Transformations are used to
build the execution plan, while actions trigger the actual computation and return results.

2) How do you write a pyspark snippet to read a CSV file and display its contents?

Ans: from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

df = spark.read.csv("your_file.csv", header=True, inferSchema=True)

df.show()

3) How do you optimize a slow-performing SQL query that involves multiple joins and
aggregates in a large dataset?

Ans: I would consider creating appropriate indexes, using query optimization techniques,
partitioning tables, and denormalizing data if necessary. I would also analyze query execution
plans and use profiling tools to identify bottlenecks.

4) Explain the differences between INNER JOIN, LEFT JOIN, and RIGHT JOIN in SQL.

Ans: INNER JOIN returns only matching rows, LEFT JOIN returns all rows from the left table, and
RIGHT JOIN returns all rows from the right table.

5) How can you filter rows in a PySpark DataFrame based on a specific condition?

Ans: You can use the `filter()` method or SQL-like expressions with the `where()` method. For
example: filtered_df = df.filter(df.column_name > 100)

6) What are Parquet and Avro file formats, and How can you write a PySpark DataFrame to a
Parquet file?
Ans: Parquet and Avro are columnar storage file formats that are optimized for analytics
and big data workloads. They support storage and processing, support schema evolution,
and are well-suited for data warehouse and ETL processes.

You can use the write.parquet() method: df.write.parquet("output.parquet")

7) What is ETL in data warehousing? How do you simplify the process of ETL using AWS?
Ans: ETL (Extract, Transform, Load) is the process of extracting data from various sources,
transforming it into a consistent format, and loading it into a data warehouse for analysis.
We can use AWS Glue to manage ETL process AWS Glue automates much of the ETL
process. It can automatically discover, catalog, and transform data, which makes easier to
prepare data for analytics and reporting.

8) What is a star schema and how is it different from a snowflake schema in a data
warehouse?
Ans: star schema has centralized fact table with denormalized dimensions, and snowflake
schema has normalized dimension tables linked together.
9) What’s the role of a data catalog in a data warehouse architecture?
Ans: Data catalog helps users to discover and understand available data assets, providing
metadata and data lineage information.
10) What is use of Amazon Kinesis and how it differs from AWS SQS.
Ans: Amazon Kinesis is used for real-time data streaming, while SQS is a message queuing
service for decoupling components in distributed systems.
11) How can you create a new column in a PySpark DataFrame by applying a user-defined
function to existing columns?
Ans: You can use the withColumn() method to add a new column by applying a user-defined
function to existing columns.
12) How do you cache a PySpark DataFrame, and why might you want to do so?
Ans: To cache a PySpark DataFrame, I would use the cache() or persist() method. Caching is
useful when you have a DataFrame that will be reused in multiple operations to avoid
recomputation and improve performance.
13) Which AWS service can you use for data warehousing?
Ans: We can use Amazon Redshift which is fully managed data warehouse service in AWS.
It's designed for high-performance querying and analytics and is commonly used for data
warehousing and business intelligence.
14) What is the difference between Amazon RDS and Amazon DynamoDB in AWS?
Ans: Amazon RDS is a managed relational database service, while Amazon DynamoDB is a
NoSQL database service. RDS is best for traditional SQL databases, while DynamoDB is a
scalable, highly available NoSQL solution.
15) Can you explain the purpose of the HAVING clause in SQL and can you provide an example
where you have used it?
Ans: The HAVING clause is used in SQL to filter rows returned by a GROUP BY clause based
on a specified condition. For example:
SELECT department, AVG(salary) FROM employees GROUP BY department HAVING
AVG(salary) > 50000;
16) Explain the role of Hive and Hadoop in big data processing. How do they interact with each
other?
Answer: Hive is a data warehousing and SQL-like querying tool for Hadoop. It provides a
higher-level abstraction over Hadoop MapReduce, allowing users to query and analyze data
using SQL-like syntax. Hive queries are ultimately translated into MapReduce jobs for
execution on Hadoop.

Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Redshift Vs Snowflake - An In-Depth Comparison PDF
100% (2)
Redshift Vs Snowflake - An In-Depth Comparison PDF
19 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
AIS Chapter 11 Enterprise Resource Planning Systems
100% (3)
AIS Chapter 11 Enterprise Resource Planning Systems
3 pages
What Motivated Data Mining? Why Is It Important?
No ratings yet
What Motivated Data Mining? Why Is It Important?
14 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Aws Pyspark
No ratings yet
Aws Pyspark
1 page
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Sample Questions Human Resource Information System
No ratings yet
Sample Questions Human Resource Information System
9 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Chapter: Architecture & User Interface Design
No ratings yet
Chapter: Architecture & User Interface Design
26 pages
ADBMS Lab Manual
100% (1)
ADBMS Lab Manual
69 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
RDD Questions
No ratings yet
RDD Questions
1 page
Pyspark 1
No ratings yet
Pyspark 1
7 pages
Business Intelligence Framework For Pharmaceutical
100% (13)
Business Intelligence Framework For Pharmaceutical
16 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
CDE Sample Interview Questions
No ratings yet
CDE Sample Interview Questions
10 pages
Question
No ratings yet
Question
6 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Pyspark
No ratings yet
Pyspark
6 pages
LT Mindtree
No ratings yet
LT Mindtree
3 pages
Senior Data Engineer Qs
No ratings yet
Senior Data Engineer Qs
7 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
MCQ & Answers: DWM EXAM 2020-21
No ratings yet
MCQ & Answers: DWM EXAM 2020-21
17 pages
Spark Material
No ratings yet
Spark Material
6 pages
Senior Data Engineer Qna
No ratings yet
Senior Data Engineer Qna
4 pages
Pyspark Dataframe Questions
No ratings yet
Pyspark Dataframe Questions
1 page
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Tiger Analytics 1735834470
No ratings yet
Tiger Analytics 1735834470
27 pages
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
No ratings yet
Most Asked Interview Questions in Top MNC'S: 1. A. Partitioning Caching Broadcasting
4 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
Py Spark
No ratings yet
Py Spark
7 pages
Spark Questions
No ratings yet
Spark Questions
7 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
No ratings yet
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
25 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
B LSC CD W1 Geiv Yx BAmc EE3 U
No ratings yet
B LSC CD W1 Geiv Yx BAmc EE3 U
166 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
2 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Rainfall Analysis Implementing On Data Warehouse
No ratings yet
Rainfall Analysis Implementing On Data Warehouse
12 pages
Questions and Answers
No ratings yet
Questions and Answers
19 pages
CIS Theory - MachineLearning
No ratings yet
CIS Theory - MachineLearning
13 pages
Master of Computer Applications (Mca)
No ratings yet
Master of Computer Applications (Mca)
65 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
15 pages
Star and Snowflake Schemas: What Is A Star Schema?
No ratings yet
Star and Snowflake Schemas: What Is A Star Schema?
18 pages
1 Problem Statement
No ratings yet
1 Problem Statement
1 page
What Is The Purpose of Factless Fact Table
No ratings yet
What Is The Purpose of Factless Fact Table
11 pages
Data Science
No ratings yet
Data Science
23 pages
Unit 4 New
No ratings yet
Unit 4 New
129 pages
Data Mining
No ratings yet
Data Mining
48 pages
Data Warehousing
No ratings yet
Data Warehousing
111 pages
DM & DW
No ratings yet
DM & DW
2 pages
Project in Business Healtcare BI
No ratings yet
Project in Business Healtcare BI
47 pages
Cse 6TH Sem Syllabus
No ratings yet
Cse 6TH Sem Syllabus
9 pages
Informatica CV
No ratings yet
Informatica CV
5 pages
Data Mining Concepts and Techniques 2nd Edition by Jiawei Han, Jian Pei, Micheline Kamber ISBN 1558609016 9781558609013 Instant Download
100% (4)
Data Mining Concepts and Techniques 2nd Edition by Jiawei Han, Jian Pei, Micheline Kamber ISBN 1558609016 9781558609013 Instant Download
52 pages
Btech Cs 6 Sem Datawarehousing and Data Mining Ncs 066 2017 18
No ratings yet
Btech Cs 6 Sem Datawarehousing and Data Mining Ncs 066 2017 18
2 pages
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
No ratings yet
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
65 pages
ACA Notes
No ratings yet
ACA Notes
118 pages
Fundamentals of Data Warehousing: Ms. Liza Mae P. Nismal
No ratings yet
Fundamentals of Data Warehousing: Ms. Liza Mae P. Nismal
15 pages
Translet
No ratings yet
Translet
29 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Amazon Web Services (AWS) Interview Questions and Answers
From Everand
Amazon Web Services (AWS) Interview Questions and Answers
Tech Interviews
4.5/5 (3)
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
Mastering Amazon Redshift: Scalable Cloud Data Warehousing
From Everand
Mastering Amazon Redshift: Scalable Cloud Data Warehousing
Robert Johnson
No ratings yet

KBKrishnaTeja Interview Questions

Uploaded by

KBKrishnaTeja Interview Questions

Uploaded by

1) What is the difference between transformations and actions in PySpark, and when you

would use each?

Ans: from pyspark.sql import SparkSession

df = spark.read.csv("your_file.csv", header=True, inferSchema=True)

You can use the write.parquet() method: df.write.parquet("output.parquet")

You might also like