0% found this document useful (0 votes)
7 views

Pyspark and python preparation notes

Useful for python and pyspark interview preparation

Uploaded by

vishnutej016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Pyspark and python preparation notes

Useful for python and pyspark interview preparation

Uploaded by

vishnutej016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

PYTHON CODING

1. Function Definition: The function find_most_occuring(my_string) accepts a string


and is supposed to calculate the frequency of each character.
2. Dictionary Usage: A dictionary dict_1 is initialized to store each character as a key, with its
count as the value.
3. Character Counting: A loop iterates through my_string. For each character, it checks if the
character already exists in dict_1. If it does, it increments the count by 1; otherwise, it
initializes the character's count at 1.
4. Sort by Frequency: Once the counts are in dict_1, sort the dictionary items by values in
descending order.
5. Output the Top 3: Extract and print the top three characters and their counts

INT QUESTIONS On ability to write efficient code, handle edge cases, and think
about improvements or alternative methods

1. How does your code handle uppercase and lowercase letters?


2. Can you explain why you chose a dictionary for counting?
3. What is the time complexity of your solution?
4. How would you handle ties in the top three characters if two characters have the same
frequency?
5. How could you modify this function to return the result as a list of tuples instead of
printing it?
6. How would you modify this code to handle very large strings?
7. Can you think of a built-in Python function or library that could simplify this task?
8. How would you test this function?

Finding the paint color with the lowest price from a dictionary
of paint colors and prices.
This task tests their knowledge of dictionary operations, function definitions, and the
use of the min function with a key argument in Python.

1. Why did you use min with key=paints.get?


2. What would happen if two colors had the same minimum price?
3. What is the time complexity of this solution?
4. How would you modify the function to return the top three cheapest colors instead of
just one?
5. What are some edge cases to consider for this function?
6. Can you implement this using list comprehension instead of min?
7. What is the advantage of using a dictionary over a list in this scenario?

Spark-based data processing tasks


1. Create a list of tuples (rows): Each tuple contains a numeric value and a string
identifier (e.g., (1, 'id1')).
2. Convert list to an RDD: The candidate is converting the list of rows into a Resilient
Distributed Dataset (RDD) using spark.sparkContext.parallelize(rows).
3. Define a schema: The schema includes two fields:
o value (IntegerType)
o id (StringType)
4. Convert the RDD to a DataFrame: The candidate converts the RDD to a DataFrame
using the defined schema and then displays it with df.show().
5. Creating a DataFrame from raw data (such as a list of tuples).
6. Structuring data with a schema definition in PySpark.
7. Data transformation in a distributed environment using Spark.
8. Demonstrating how to create and structure a DataFrame in PySpark.
9. Working with RDDs and schemas in PySpark.
10. Displaying or querying data in a DataFrame.

Coding interview involves PySpark and DataFrames/ PySpark’s architecture, data


manipulation, and performance optimization techniques.

1. What is PySpark, and how does it differ from traditional Python data processing?
2. Can you explain the difference between an RDD and a DataFrame in PySpark?
3. Why do we need to define a schema in PySpark DataFrames?
4. How would you create a DataFrame from a list of tuples in PySpark?
5. What are some common use cases for Spark in a data engineering context?
6. How would you filter rows in this DataFrame where value is greater than 1?
7. What is lazy evaluation in Spark, and how does it apply to transformations in this
DataFrame?
8. Can you explain how Spark handles data across partitions, and why this is beneficial
for big data?
9. How would you perform a groupBy operation on this DataFrame, for example,
grouping by value?
10. What are some performance optimization techniques you could apply in PySpark?
11. How would you add a new column to the DataFrame with a transformed version of
value, for example, doubling each value?
12. If we need to save this DataFrame to a file or a database, how would we do that in
PySpark?
13. Can you explain what happens when you use df.show() versus df.collect()?

You might also like