KBKrishnaTeja Interview Questions
KBKrishnaTeja Interview Questions
Ans: Transformations are operations on RDDs or DataFrames that create a new RDD or
DataFrame, while actions return a value to the driver program. Transformations are used to
build the execution plan, while actions trigger the actual computation and return results.
2) How do you write a pyspark snippet to read a CSV file and display its contents?
spark = SparkSession.builder.appName("example").getOrCreate()
3) How do you optimize a slow-performing SQL query that involves multiple joins and
aggregates in a large dataset?
Ans: I would consider creating appropriate indexes, using query optimization techniques,
partitioning tables, and denormalizing data if necessary. I would also analyze query execution
plans and use profiling tools to identify bottlenecks.
4) Explain the differences between INNER JOIN, LEFT JOIN, and RIGHT JOIN in SQL.
Ans: INNER JOIN returns only matching rows, LEFT JOIN returns all rows from the left table, and
RIGHT JOIN returns all rows from the right table.
5) How can you filter rows in a PySpark DataFrame based on a specific condition?
Ans: You can use the `filter()` method or SQL-like expressions with the `where()` method. For
example: filtered_df = df.filter(df.column_name > 100)
6) What are Parquet and Avro file formats, and How can you write a PySpark DataFrame to a
Parquet file?
Ans: Parquet and Avro are columnar storage file formats that are optimized for analytics
and big data workloads. They support storage and processing, support schema evolution,
and are well-suited for data warehouse and ETL processes.
7) What is ETL in data warehousing? How do you simplify the process of ETL using AWS?
Ans: ETL (Extract, Transform, Load) is the process of extracting data from various sources,
transforming it into a consistent format, and loading it into a data warehouse for analysis.
We can use AWS Glue to manage ETL process AWS Glue automates much of the ETL
process. It can automatically discover, catalog, and transform data, which makes easier to
prepare data for analytics and reporting.
8) What is a star schema and how is it different from a snowflake schema in a data
warehouse?
Ans: star schema has centralized fact table with denormalized dimensions, and snowflake
schema has normalized dimension tables linked together.
9) What’s the role of a data catalog in a data warehouse architecture?
Ans: Data catalog helps users to discover and understand available data assets, providing
metadata and data lineage information.
10) What is use of Amazon Kinesis and how it differs from AWS SQS.
Ans: Amazon Kinesis is used for real-time data streaming, while SQS is a message queuing
service for decoupling components in distributed systems.
11) How can you create a new column in a PySpark DataFrame by applying a user-defined
function to existing columns?
Ans: You can use the withColumn() method to add a new column by applying a user-defined
function to existing columns.
12) How do you cache a PySpark DataFrame, and why might you want to do so?
Ans: To cache a PySpark DataFrame, I would use the cache() or persist() method. Caching is
useful when you have a DataFrame that will be reused in multiple operations to avoid
recomputation and improve performance.
13) Which AWS service can you use for data warehousing?
Ans: We can use Amazon Redshift which is fully managed data warehouse service in AWS.
It's designed for high-performance querying and analytics and is commonly used for data
warehousing and business intelligence.
14) What is the difference between Amazon RDS and Amazon DynamoDB in AWS?
Ans: Amazon RDS is a managed relational database service, while Amazon DynamoDB is a
NoSQL database service. RDS is best for traditional SQL databases, while DynamoDB is a
scalable, highly available NoSQL solution.
15) Can you explain the purpose of the HAVING clause in SQL and can you provide an example
where you have used it?
Ans: The HAVING clause is used in SQL to filter rows returned by a GROUP BY clause based
on a specified condition. For example:
SELECT department, AVG(salary) FROM employees GROUP BY department HAVING
AVG(salary) > 50000;
16) Explain the role of Hive and Hadoop in big data processing. How do they interact with each
other?
Answer: Hive is a data warehousing and SQL-like querying tool for Hadoop. It provides a
higher-level abstraction over Hadoop MapReduce, allowing users to query and analyze data
using SQL-like syntax. Hive queries are ultimately translated into MapReduce jobs for
execution on Hadoop.