0% found this document useful (0 votes)
31 views8 pages

Midterm Exam Multiple Choice

Uploaded by

Tuan Anh Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views8 pages

Midterm Exam Multiple Choice

Uploaded by

Tuan Anh Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

MINISTRY OF EDUACATION AND MIDTERM EXAM

TRAINING
.....................................
NATIONAL ECONOMIC
..
UNIVERSITY
Program: DSEB Intake:
63

Faculty of Economic Date: 09/11/2024


Mathematics Session: 1
DSEB Program
Time limit: 30 minutes

1. What is the default behavior of the dropDuplicates() method in a


DataFrame?

A) It drops all rows.

B) It keeps the first occurrence of each duplicate row.

C) It drops all duplicates without keeping any.

D) It only drops duplicates based on specified columns.

2. Which of the following methods can be used to persist data in Spark?


(Select all that apply)

A) cache()

B) persist()

C) saveAsTextFile()

D) store()

3. How can you register a DataFrame as a temporary view in Spark SQL?

A) df.createOrReplaceTempView("view_name")

B) df.registerTempView("view_name")
C) df.createGlobalTempView("view_name")

D) df. createOrReplaceTemporaryView ("view_name")

4. What does the map() transformation do in Spark?

A) It filters elements from an RDD.

B) It applies a function to each element and returns a new RDD

C) It reduces the number of partitions.

D) It combines two RDDs.

5. Which of the following describes "lazy evaluation" in Spark?

A) Operations are executed immediately upon being called.

B) Transformations are not computed until an action is called

C) Data is stored on disk by default.

D) All computations happen in parallel.

6. When using Spark SQL, what is the purpose of the explain() method?

A) To execute the query

B) To display the physical plan for the query execution

C) To optimize the query

D) To show the schema of the DataFrame

7. What is a common use case for window functions in Spark SQL?

A) To group data by categories

B) To perform calculations across a set of rows related to the current row

C) To filter data based on conditions

D) To create temporary views


8. In Spark, what does the cache() method do?

A) It permanently stores the DataFrame.

B) It optimizes the query plan.

C) It stores the DataFrame in memory for faster access.

D) It drops the DataFrame from memory.

9. How can you group data in a DataFrame and perform an aggregation?

A) df.groupBy("column").agg(sum("value"))

B) df.aggregate("column", sum("value"))

C) df.group("column").sum("value")

D) df.groupBy("column").aggregate(sum("value"))

10. Which of the following can be used to handle missing values in a


DataFrame? (Select all that apply)

A) fillna(value)

B) dropna()

C) replaceNulls(value)

D) ignoreNulls()

11. What does the coalesce() method do when applied to a DataFrame?

A) It increases the number of partitions.

B) It reduces the number of partitions.

C) It merges multiple DataFrames.

D) It filters out null values.

12. What type of join does Spark perform by default when joining two
DataFrames?
A) Inner join

B) Left join

C) Right join

D) Full outer join

13. What does the reduceByKey() operation do?

A) It combines values with the same key using a specified function

B) It filters out keys based on a condition.

C) It sorts keys in ascending order.

D) It groups keys together without aggregation.

14. Which of the following statements about Spark DataFrames is true?


(Select all that apply)

A) They are immutable.

B) They can contain mixed data types.

C) They can only contain numeric data types.

D) They are optimized for query execution.

15. What is the purpose of the HAVING clause in Spark SQL?

A) To filter records before aggregation

B) To filter records after aggregation

C) To sort records

D) To group records

16. How do you perform an inner join between two DataFrames?

A) df1.join(df2, "key", "inner")

B) df1.innerJoin(df2, "key")
C) df1.join(df2, "key")

D) df1.joinInner(df2, "key")

17. Which method is used to rename a column in a DataFrame?

A) renameColumn()

B) withColumnRenamed("oldName", "newName")

C) changeColumnName()

D) setColumnName()

18. How can you optimize query performance in Spark SQL? (Select all that
apply)

A) Use partitioning on large tables

B) Avoid using too many joins

C) Always use non-optimized formats like CSV

D) Cache frequently accessed DataFrames

19. How do you perform an aggregation with grouping in Spark SQL?

A) SELECT column, SUM(value_column FROM table GROUP BY column

B) SELECT SUM(value_column), GROUP BY column FROM table

C) SELECT column, COUNT(value_column FROM table GROUP BY column

D) SELECT GROUP(column), SUM(value_column FROM table

20. Which of the following is the best practice for handling large datasets in
Spark?

A) Load all data into memory at once

B) Use partitioning to distribute data efficiently

C) Avoid using caching or persistence


D) Read data from disk only once

21. How should you handle skewed data in Spark?

A) Ignore the skew and proceed with processing

B) Use salting techniques to distribute data evenly

C) Increase the number of partitions

D) Use only one partition for processing

22. Which method allows you to change the data type of a column in a
DataFrame?

A) cast("newType")

B) changeType("newType")

C) convertType("newType")

D) modifyType("newType")

23. What is the best approach to monitor Spark applications?

A) Monitor logs only after job completion

B) Use the Spark UI and external monitoring tools

C) Ignore monitoring unless there are errors

D) Rely solely on system resource metrics

24. What does the explode() function do in Spark DataFrames?

A) It flattens nested structures into separate rows.

B) It combines multiple columns into one.

C) It filters out null values.

D) It aggregates data.
25. How can you apply a user-defined function (UDF) to a column in a
DataFrame?

A) df.apply(udf, "column")

B) df.withColumn("new_column", udf(df["column"]))

C) df.transform(udf, "column")

D) df.udf("column")

26. How can you optimize performance in a Spark application? (Select all
that apply)

A) Using partitioning effectively

B) Reducing the number of transformations

C) Increasing the number of partitions unnecessarily

D) Caching intermediate results

27. In which scenario would you use broadcast variables in Spark?

A) To send large amounts of data to all nodes efficiently

B) To store intermediate results.

C) To partition data across nodes.

D) To filter datasets.

28. Which of the following methods can be used to create a DataFrame from
an existing RDD?

A) createDataFrame()

B) toDF()

C) fromRDD()

D) loadDataFrame()
29. Which of the following is NOT a feature of Apache Spark?

A) In-memory processing

B) Lazy evaluation

C) Real-time data processing

D) Strict consistency

30. What should you do to avoid memory issues when processing large
datasets?

A) Increase the driver memory limit

B) Use more shuffle partitions

C) Load all data into memory

D) Reduce the number of executors

You might also like