0% found this document useful (0 votes)
18 views13 pages

Q1. Difference Between Cache and Pe

The document discusses various concepts related to Apache Spark, including the differences between caching and persisting data, as well as repartitioning and coalescing data. It also covers optimization techniques for Spark, types of tables in Hive, handling data skewness, and best practices for SQL query optimization. Additionally, it explains the roles of drivers and executors in Spark, fault tolerance, and various data transformation methods.

Uploaded by

svlukhanina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Q1. Difference Between Cache and Pe

The document discusses various concepts related to Apache Spark, including the differences between caching and persisting data, as well as repartitioning and coalescing data. It also covers optimization techniques for Spark, types of tables in Hive, handling data skewness, and best practices for SQL query optimization. Additionally, it explains the roles of drivers and executors in Spark, fault tolerance, and various data transformation methods.

Uploaded by

svlukhanina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 13

Q1. Difference between cache and persist, repartition and coalesce.

Ans. Cache stores data in memory for quick access, while persist saves it to disk.
Repartition changes data distribution; coalesce reduces partitions.
Cache: Stores DataFrame in memory for faster access during subsequent operations.

Persist: Saves DataFrame to disk, allowing for fault tolerance but slower than
cache.

Repartition: Increases or decreases the number of partitions, potentially shuffling


data across nodes.

Coalesce: Reduces the number of partitions without a full shuffle, more efficient
for decreasing partitions.

Example of Cache: df.cache() to speed up iterative algorithms.

Example of Persist: df.persist(StorageLevel.MEMORY_AND_DISK) for fault tolerance.

Example of Repartition: df.repartition(10) to increase partitions for parallel


processing.

Example of Coalesce: df.coalesce(2) to reduce partitions after filtering.

Q2
Apache Spark optimization techniques improve performance by enabling efficient data
processing and better resource management.

Key optimization strategies include:

Use DataFrames and Datasets instead of RDDs, as they benefit from Catalyst
optimizer and Tungsten execution engine.

Leverage lazy evaluation to minimize unnecessary computations and trigger only the
necessary transformations during an action.

Apply partitioning to evenly distribute data across nodes:

Use repartition() to increase the number of partitions (e.g., before large joins).

Use coalesce() to reduce partitions without full data shuffle (e.g., before writing
to disk).

Minimize shuffling:

Prefer narrow transformations (like map, filter) where each partition depends only
on a single partition from the parent RDD.

Be cautious with wide transformations (like groupByKey, reduceByKey, join) that


require data to be shuffled across the network.

Broadcast smaller datasets:

Use broadcast joins to avoid costly shuffles when joining a large dataset with a
small dataset (broadcast(df)).

Use caching and persistence:

Cache frequently accessed intermediate results with df.cache() or df.persist() to


avoid recomputation and speed up iterative operations.
Types of Transformations:

Narrow Transformations: map(), filter(), union()

Wide Transformations: groupByKey(), reduceByKey(), join()

Shuffling:

Shuffling is the process of redistributing data across partitions, which is an


expensive operation. Minimizing unnecessary shuffles is crucial for Spark
performance optimization.

Q3. What are the different types of tables in Hive, and what are the differences
between them?
Ans. Hive supports two types of tables: Managed and External, each with distinct
data management and storage characteristics.
Managed Tables: Hive manages both the schema and the data. Dropping the table
deletes the data.

External Tables: Hive manages only the schema. Dropping the table does not delete
the data, which remains in the external storage.

Use Managed Tables for temporary data that can be recreated easily.

Use External Tables for data that is shared with other applications or needs to
persist beyond Hive.

Example of Managed Table: CREATE TABLE managed_table (id INT, name STRING);

Example of External Table: CREATE EXTERNAL TABLE external_table (id INT, name
STRING) LOCATION '/path/to/data';

Task SQL server connect

Feature RDD DataFrame Dataset


Level of abstraction Low High Middle
Type safety Yes No Yes
Performance optimization No Yes (Catalyst) Yes (Catalyst)
API style Functional SQL + Functional SQL + Functional
Best use case Unstructured or complex transformations Structured data (CSV,
JSON) Structured data with type safety (Scala/Java only)

SQL Server
url =
"jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb;user=myuser@myse
rver;password=mypassword;encrypt=true;"

df = spark.read.format("jdbc").option("url", url).option("dbtable",
"my_table").load()

properties = {
"user": "myuser@myserver",
"password": "mypassword",
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=url, table="my_table", properties=properties)

Нужно убедиться, что фаервол Azure SQL пропускает IP-адрес кластера Spark (иначе
соединение заблокируется).

Можно настроить Azure Active Directory (AAD) для более безопасной аутентификации
вместо обычного пароля.

df.write \
.mode("append") \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "target_table") \
.option("user", "username") \
.option("password", "password") \
.option("batchsize", "1000") \
.save()

Q2. Write code for printing duplicate numbers in a list.


Ans. This code identifies and prints duplicate numbers from a given list using a
dictionary to track occurrences.
Use a dictionary to count occurrences of each number.

Iterate through the list and update the count in the dictionary.

Print numbers that have a count greater than 1.

Example: For the list [1, 2, 3, 2, 4, 3], the output should be 2 and 3.

def print_duplicates(nums):
count = {}

for num in nums:


if num in count:
count[num] += 1
else:
count[num] = 1

for num, freq in count.items():


if freq > 1:
print(num)

# Example usage
nums = [1, 2, 3, 2, 4, 3]
print_duplicates(nums)

Q5

Summary:
Driver: Coordinates the job, creates a logical plan, and schedules tasks.

Executor: Runs the tasks and computes results.

Cluster Manager: Manages resources in the cluster.

RDD/DataFrame: Core abstractions for data.


Job Execution: The job goes through stages and tasks, possibly with shuffling, and
results are returned to the driver.

Q6. How will you handle data skewness in spark


Ans. Data skewness can be handled in Spark by using techniques like partitioning,
bucketing, and broadcasting.
Partitioning the data based on a key column can distribute the data evenly across
the cluster.

Bucketing can further divide the data into smaller buckets based on a hash
function.

Broadcasting small tables can reduce the amount of data shuffled across the
network.

Using dynamic allocation can also help in handling data skewness by allocating more
resources to tasks that are taking longer to complete.

ADD Qss

1) Difference Between OLAP and OLTP, Normalization and DeNor, Union and UnionAll,
2) SCD Types 3) ETL PIP line; We will get the data from Source CSV as Table data
But we Need a Destination in CSV, No data loading In SQL Server . which
transformation do you use? 4) How Will You do Optimization in SQL Querry and
StoreProcedure.

Third Normal Form (3NF) is a stage of database normalization designed to reduce


redundancy and ensure that data dependencies are logically structured.

SCD (Slowly Changing Dimension) refers to handling changes in data over time in a
data warehouse. There are several types of SCDs:

SCD Type 1: Overwrite – When a record’s value changes, the old value is overwritten
with the new value. This is suitable for non-historical data where only the latest
value is needed.

SCD Type 2: History Tracking – When a record changes, a new row is inserted with
the new values, and the old row is retained with a version history. This allows
tracking of changes over time.

Additional columns like start_date, end_date, and current_flag are often used to
indicate the active record.

SCD Type 3: Limited History – This approach keeps the current value and the
previous value in the same row. It’s suitable when you need to track only a limited
number of previous values (usually 1 or 2).

SCD Type 4: Historical Table – Similar to Type 2, but historical data is stored in
a separate table to maintain the current record in the main table.

SCD Type 6: Hybrid – A combination of Type 1, Type 2, and Type 3, often used when
you need to keep both the latest data and the history of changes in one table.

import pandas as pd

# Load the data


df = pd.read_csv('source_file.csv')

# Perform transformations (example: remove rows with null values)


df_cleaned = df.dropna()

# Save the cleaned data to a new CSV file


df_cleaned.to_csv('destination_file.csv', index=False)

4) SQL Query and Stored Procedure Optimization


When optimizing SQL queries and stored procedures, the key goals are to reduce
query execution time and minimize resource consumption (e.g., CPU, memory, disk
I/O). Below are some tips:

SQL Query Optimization:


Indexes:

Use appropriate indexes (e.g., on frequently queried columns).

Be cautious about over-indexing, as it may slow down insert/update operations.

Avoid Select * (wildcard):

Always select only the necessary columns to reduce the amount of data being
transferred.

Use Joins Efficiently:

Prefer inner joins over outer joins when possible.

Make sure that join conditions are indexed.

Use WHERE Clause Effectively:

Filter records early using the WHERE clause to reduce the data set size.

Avoid Subqueries:

Subqueries can often be rewritten as joins for better performance.

Optimize Aggregations:

Use indexes and try to minimize the number of records being aggregated.

Batch Processing:

Use batch processing for large data sets to break the processing into smaller
chunks.

Stored Procedure Optimization:


Avoid Loops for Data Processing:

Whenever possible, avoid looping through records in stored procedures. Instead, use
set-based operations like JOIN or UPDATE.

Use Temporary Tables Wisely:

For large transformations, temporary tables can improve performance by breaking


down large operations into smaller, more manageable chunks.

Minimize Locking:

Optimize queries within stored procedures to minimize locking and deadlocks.


Proper Exception Handling:

Ensure efficient error handling to prevent unnecessary retries or rollback, which


can affect performance.

Analyze Execution Plans:

Use SQL Server’s execution plan to identify bottlenecks (e.g., scans, missing
indexes, etc.).

Final Architecture Overview:


Ingestion: Use Kafka for real-time data streaming.

Storage: Use a Data Lake (S3/HDFS) for raw data and NoSQL for real-time data.

Processing: Use Apache Spark Streaming for real-time processing and Airflow for
batch processing.

Transformation & Enrichment: Use Spark SQL or custom Python functions.

Data Quality & Monitoring: Use tools like Apache Deequ and Prometheus.

Delivery: Send processed data to dashboards or data warehouses.

Block ++
Fault Tolerance:
What is fault tolerance in the context of distributed systems, and why is it
important?

Fault tolerance is the ability of a system to continue functioning properly even in


the event of failures (e.g., machine failure, network failure). In distributed
computing, fault tolerance is essential to ensure that the system remains available
and consistent despite component failures.

How does Spark handle faults in a distributed environment?

Spark uses RDD lineage to recompute lost data in case of a failure. Each RDD has a
record of the operations that created it, so Spark can recreate the lost partition
by applying the same operations to the data that led to it.

What happens if an executor or driver fails in Spark?

Executor failure: If an executor fails, the tasks assigned to it are reallocated to


other executors. The lost partitions can be recomputed from the RDD lineage if
necessary.

Driver failure: If the driver fails, the entire job fails. Spark requires the
driver to schedule tasks, manage the execution plan, and handle communication with
executors.

In Spark, the driver is responsible for splitting the job into tasks and
distributing them to different executors. The cluster manager (e.g., YARN, Mesos)
manages the resources and allocates nodes to executors. Tasks are executed in
parallel on the worker nodes.

4. Executor and Driver Failures:


What is the role of the driver in a Spark job?

The driver is responsible for the overall orchestration of a Spark job. It


schedules tasks, manages the DAG, and coordinates with the cluster manager to
allocate resources.

What is the role of an executor in Spark?

Executors are worker nodes in Spark that execute the tasks assigned to them by the
driver. Each executor runs in its own JVM and processes a portion of the data.

What is shuffling in Spark, and when does it occur?

Shuffling is the process of redistributing data across different nodes, typically


as a result of wide transformations like groupByKey, reduceByKey, join, etc. It
involves transferring data between nodes, which can be costly in terms of time and
resources.

Why is shuffling expensive, and how can we minimize it?

Shuffling involves network I/O and disk operations, which are slow and resource-
intensive. To minimize shuffling:

Use narrow transformations (e.g., map, filter) instead of wide transformations


(e.g., groupBy, join).

Use broadcast joins for small datasets.

Repartition or coalesce data strategically to avoid excessive shuffling.

6. Functional Data Engineering:


What is functional programming in the context of data engineering?

Functional programming focuses on immutability, first-class functions, and avoiding


side effects. In Spark, functional transformations like map, filter, and reduce are
commonly used. These operations are typically applied to immutable datasets (e.g.,
RDDs or DataFrames).

What are the benefits of using functional programming in data engineering?

It promotes cleaner, more modular, and maintainable code.

It enables parallel processing and better optimization.

It provides better error handling and debugging, as functions are often pure (i.e.,
they don’t change state outside their scope).

7. Best Practices in Distributed Data Engineering:


How can you optimize Spark jobs for better performance?

Use DataFrames and Datasets instead of RDDs for better optimization.

Minimize shuffling by avoiding wide transformations where possible.

Use partitioning strategies to ensure that data is distributed evenly across nodes.

Cache frequently used data to reduce recomputations.

Use broadcast joins for smaller datasets to avoid shuffling.


++
2. Different Types of Charts
Bar Chart:

Used for comparing categories of data. It displays data with rectangular bars, with
the length of each bar proportional to the value it represents.

Use case: Comparing sales performance across different months.

Line Chart:

Used to show trends over time. The data points are connected by straight lines.

Use case: Displaying stock market trends or temperature variations.

Pie Chart:

Used to represent data as a percentage of a whole. Each slice represents a


category's contribution to the total.

Use case: Market share distribution among competitors.

Scatter Plot:

Used to show relationships between two continuous variables. Each point represents
a pair of values.

Use case: Analyzing correlation between advertising budget and sales.

Histogram:

Used for showing the distribution of a single variable. It divides data into bins
and shows the frequency of values in each bin.

Use case: Distribution of age groups in a population survey.

Heatmap:

Displays data in matrix form with color coding to represent values.

Use case: Displaying correlations between multiple variables or showing website


user activity over time.

Area Chart:

Similar to line charts but with the area below the line filled in.

Use case: Cumulative sales over time.

Qdeep
Projection pushdown → only the name column

Predicate pushdown → only rows where salary > 50000

if adaptive coalesce partitions enabled spark automatically merge small shuffle


partitions
if adaptive join enabled spark when planned shuffle join switches to the
broadcast/soft merge/shuffle hash join

adaptive skew join - detects skewed partitions

Qsimple
How do you perform data transformation in Spark?
Answer: Data transformation in Spark can be performed using operations like map,
filter, reduce, groupBy, and join. These transformations can be applied to RDDs,
DataFrames, and Datasets to manipulate data.

The Catalyst Optimizer is a query optimization framework in Spark SQL that


automatically optimizes the logical and physical execution plans to improve query
performance.

Explain the concept of lazy evaluation in Spark.


Answer: Lazy evaluation means that Spark does not immediately execute
transformations on RDDs, DataFrames, or Datasets. Instead, it builds a logical plan
of the transformations and only executes them when an action (like collect or save)
is called. This optimization reduces the number of passes over the data.

How do you manage Spark applications on Databricks clusters?


Answer: Spark applications on Databricks clusters can be managed by configuring
clusters (choosing instance types, auto-scaling options), monitoring cluster
performance, and using Databricks job scheduling to automate workflows.

What is the difference between wide and narrow transformations in Spark?


Answer: Narrow transformations (like map and filter) involve data shuffling within
a single partition, while wide transformations (like groupByKey and join) involve
data shuffling across multiple partitions, which can be more resource-intensive.

How do you secure data and manage permissions in Databricks?


Answer: Data security and permissions can be managed using features like encryption
at rest and in transit, role-based access control (RBAC), secure cluster
configurations, and integration with AWS IAM or Azure Active Directory.

How do you use Databricks to process real-time data?


Answer: Real-time data processing in Databricks can be achieved using Spark
Streaming or Structured Streaming. These tools allow you to ingest, process, and
analyze streaming data from sources like Kafka, Kinesis, or Event Hubs.

Qhard
1. Lineage Graph and DAG Formation
In Apache Spark, a lineage graph represents the logical execution plan of a Spark
job and tracks the transformations applied to the RDDs (Resilient Distributed
Datasets). This graph helps in recovering lost data and optimizing execution. The
Directed Acyclic Graph (DAG) is a crucial concept in Spark's execution model and is
the backbone of how Spark manages and schedules tasks.

DAG (Directed Acyclic Graph) Formation:


DAG is a directed graph where each node represents a Spark operation (such as a
transformation), and the edges represent the data flow between operations.

DAG helps Spark optimize query execution by:

Breaking down the job into smaller tasks.


Enabling task scheduling in stages.

Determining dependencies between tasks.

When a Spark job is executed, Spark constructs a DAG of stages. Each stage consists
of operations that can be executed in parallel. The stages are linked through
dependencies, which are typically defined by narrow transformations (e.g., map,
filter) and wide transformations (e.g., groupBy, join).

The lineage graph tracks the sequence of transformations that have been applied to
RDDs. If data is lost in an RDD partition, Spark can use this lineage graph to
recompute the missing data from the source.

Lineage in RDDs:
When you apply transformations to an RDD, Spark maintains the information about the
operations that created the current RDD. This is called the lineage of the RDD.

If an RDD partition is lost, the lineage graph helps Spark recompute the data from
the original source by reapplying the transformations to the original data.

Example:
Imagine you have a sequence of transformations like this:

python
Copy
Edit
rdd1 = sc.textFile("data.txt") # Read data
rdd2 = rdd1.filter(lambda x: "error" in x) # Filter rows with "error"
rdd3 = rdd2.map(lambda x: x.split()) # Split each row into words
In this case, Spark constructs a DAG with the following stages:

Stage 1: Read data (rdd1)

Stage 2: Filter rows (rdd2)

Stage 3: Map operation (rdd3)

The lineage of rdd3 will be the operations applied to rdd1 (i.e., filter and map),
which helps in data recomputation in case of failure.

2. RDDs Characteristics:
RDDs (Resilient Distributed Datasets) are a fundamental data structure in Spark,
providing an abstraction for working with distributed data in a fault-tolerant way.
Below are the key characteristics of RDDs:

Key Characteristics:
Immutable:

RDDs are immutable, meaning once created, they cannot be modified.

To perform transformations on an RDD, you need to create a new RDD from the
existing one (via transformations like map, filter, etc.).

Distributed:

RDDs are split into partitions that are distributed across the cluster. Each
partition is processed independently on different nodes in the cluster.
This allows Spark to take advantage of parallel processing.

Fault Tolerant:

RDDs are fault-tolerant due to lineage. If any partition of an RDD is lost, Spark
can recompute the lost data using the lineage information.

This is one of the key advantages of RDDs over other data structures.

Lazy Evaluation:

Transformations applied on RDDs are lazily evaluated. This means Spark does not
immediately execute operations until an action (like collect(), count(), or save())
is triggered.

This enables Spark to optimize the execution plan before running the jobs.

Parallel Processing:

Operations on RDDs are designed to be parallelized across multiple nodes in the


cluster. Spark automatically splits the RDD into smaller chunks (partitions), and
these partitions are processed concurrently by different nodes.

Type-Agnostic:

RDDs are type-agnostic. You can store any kind of data (text, numbers, objects) in
an RDD.

Transformations vs. Actions:

Transformations: These are operations that transform one RDD into another, such as
map(), filter(), flatMap(), groupBy(), etc. They are lazy operations.

Actions: These trigger the actual execution of the RDD operations and return
results, such as count(), collect(), reduce(), etc.

Narrow vs Wide Dependencies:

Narrow Dependencies: Involve transformations where each partition of the parent RDD
is only used by one partition of the child RDD (e.g., map, filter). These
operations can be executed in parallel.

Wide Dependencies: Involve transformations where multiple partitions of the parent


RDD are shuffled across partitions (e.g., groupByKey, reduceByKey). These require
more expensive shuffling and are usually more time-consuming.

Example of Narrow and Wide Dependencies:


python
Copy
Edit
rdd1 = sc.parallelize([1, 2, 3, 4, 5])

# Narrow dependency (can be done in parallel)


rdd2 = rdd1.map(lambda x: x * 2)

# Wide dependency (requires shuffling)


rdd3 = rdd1.reduceByKey(lambda x, y: x + y)
map() is a narrow transformation, and it can be done in parallel.
reduceByKey() is a wide transformation, which requires shuffling data across
partitions.

Summary:
Lineage Graph: Tracks the transformations applied to an RDD. It helps Spark
recompute lost data and optimize execution by forming a DAG.

RDD Characteristics: RDDs are immutable, distributed, fault-tolerant, support lazy


evaluation, and provide parallel processing. RDDs use narrow and wide dependencies
to determine execution strategies.

Question: Write an SQL query to calculate the percentage difference between each
employee’s salary and the highest salary in their department.

Solution:

SELECT
emp_id,
emp_name,
dept_id,
salary,
MAX(salary) OVER (PARTITION BY dept_id) AS highest_salary_in_dept,
ROUND(((MAX(salary) OVER (PARTITION BY dept_id) - salary) /
MAX(salary) OVER (PARTITION BY dept_id)) * 100, 2) AS
percentage_difference
FROM
employees;

Question: Identify the teams that played the most matches in a given season.

Solution:

SELECT
team,
seasonyear,
COUNT(*) AS total_matches
FROM (
SELECT team1 AS team, seasonyear FROM matches
UNION ALL
SELECT team2 AS team, seasonyear FROM matches
) AS all_matches
GROUP BY team, seasonyear
ORDER BY total_matches DESC, seasonyear;

PySpark Questions
1. Percentage Contribution to Department Salary
Question: Calculate the percentage contribution of each employee’s salary to the
total salary of their department.

Solution:

from pyspark.sql import SparkSession


from pyspark.sql.functions import col, sum, round

# Initialize Spark session


spark = SparkSession.builder.appName("PercentageContribution").getOrCreate()
# Sample data
data = [
(1, "Alice", "HR", 50000),
(2, "Bob", "HR", 70000),
(3, "Charlie", "IT", 90000),
(4, "David", "IT", 60000),
(5, "Eve", "Finance", 80000),
]
columns = ["emp_id", "name", "dept", "salary"]
df = spark.createDataFrame(data, schema=columns)
# Calculate total salary per department
dept_total_salary = df.groupBy("dept").agg(sum("salary").alias("total_salary"))
# Join the total salary with the original DataFrame
df_with_total = df.join(dept_total_salary, on="dept", how="inner")
# Calculate percentage contribution
result = df_with_total.withColumn(
"percentage_contribution",
round((col("salary") / col("total_salary")) * 100, 2)
)
result.show()
This PySpark code calculates the total salary per department, joins it with the
original DataFrame, and computes each employee’s percentage contribution.

Output:

+--------+-------+-------+------+-------------+---------------------+
| dept| emp_id| name|salary|total_salary |percentage_contribution|
+--------+-------+-------+------+-------------+---------------------+
| HR| 1| Alice| 50000| 120000 | 41.67 |
| HR| 2| Bob| 70000| 120000 | 58.33 |
| IT| 3|Charlie| 90000| 150000 | 60.00 |
| IT| 4| David| 60000| 150000 | 40.00 |
| Finance| 5| Eve| 80000| 80000 | 100.00 |
+--------+-------+-------+------+-------------+---------------------+

You might also like