Q1. Difference Between Cache and Pe
Q1. Difference Between Cache and Pe
Ans. Cache stores data in memory for quick access, while persist saves it to disk.
Repartition changes data distribution; coalesce reduces partitions.
Cache: Stores DataFrame in memory for faster access during subsequent operations.
Persist: Saves DataFrame to disk, allowing for fault tolerance but slower than
cache.
Coalesce: Reduces the number of partitions without a full shuffle, more efficient
for decreasing partitions.
Q2
Apache Spark optimization techniques improve performance by enabling efficient data
processing and better resource management.
Use DataFrames and Datasets instead of RDDs, as they benefit from Catalyst
optimizer and Tungsten execution engine.
Leverage lazy evaluation to minimize unnecessary computations and trigger only the
necessary transformations during an action.
Use repartition() to increase the number of partitions (e.g., before large joins).
Use coalesce() to reduce partitions without full data shuffle (e.g., before writing
to disk).
Minimize shuffling:
Prefer narrow transformations (like map, filter) where each partition depends only
on a single partition from the parent RDD.
Use broadcast joins to avoid costly shuffles when joining a large dataset with a
small dataset (broadcast(df)).
Shuffling:
Q3. What are the different types of tables in Hive, and what are the differences
between them?
Ans. Hive supports two types of tables: Managed and External, each with distinct
data management and storage characteristics.
Managed Tables: Hive manages both the schema and the data. Dropping the table
deletes the data.
External Tables: Hive manages only the schema. Dropping the table does not delete
the data, which remains in the external storage.
Use Managed Tables for temporary data that can be recreated easily.
Use External Tables for data that is shared with other applications or needs to
persist beyond Hive.
Example of Managed Table: CREATE TABLE managed_table (id INT, name STRING);
Example of External Table: CREATE EXTERNAL TABLE external_table (id INT, name
STRING) LOCATION '/path/to/data';
SQL Server
url =
"jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb;user=myuser@myse
rver;password=mypassword;encrypt=true;"
df = spark.read.format("jdbc").option("url", url).option("dbtable",
"my_table").load()
properties = {
"user": "myuser@myserver",
"password": "mypassword",
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=url, table="my_table", properties=properties)
Нужно убедиться, что фаервол Azure SQL пропускает IP-адрес кластера Spark (иначе
соединение заблокируется).
Можно настроить Azure Active Directory (AAD) для более безопасной аутентификации
вместо обычного пароля.
df.write \
.mode("append") \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "target_table") \
.option("user", "username") \
.option("password", "password") \
.option("batchsize", "1000") \
.save()
Iterate through the list and update the count in the dictionary.
Example: For the list [1, 2, 3, 2, 4, 3], the output should be 2 and 3.
def print_duplicates(nums):
count = {}
# Example usage
nums = [1, 2, 3, 2, 4, 3]
print_duplicates(nums)
Q5
Summary:
Driver: Coordinates the job, creates a logical plan, and schedules tasks.
Bucketing can further divide the data into smaller buckets based on a hash
function.
Broadcasting small tables can reduce the amount of data shuffled across the
network.
Using dynamic allocation can also help in handling data skewness by allocating more
resources to tasks that are taking longer to complete.
ADD Qss
1) Difference Between OLAP and OLTP, Normalization and DeNor, Union and UnionAll,
2) SCD Types 3) ETL PIP line; We will get the data from Source CSV as Table data
But we Need a Destination in CSV, No data loading In SQL Server . which
transformation do you use? 4) How Will You do Optimization in SQL Querry and
StoreProcedure.
SCD (Slowly Changing Dimension) refers to handling changes in data over time in a
data warehouse. There are several types of SCDs:
SCD Type 1: Overwrite – When a record’s value changes, the old value is overwritten
with the new value. This is suitable for non-historical data where only the latest
value is needed.
SCD Type 2: History Tracking – When a record changes, a new row is inserted with
the new values, and the old row is retained with a version history. This allows
tracking of changes over time.
Additional columns like start_date, end_date, and current_flag are often used to
indicate the active record.
SCD Type 3: Limited History – This approach keeps the current value and the
previous value in the same row. It’s suitable when you need to track only a limited
number of previous values (usually 1 or 2).
SCD Type 4: Historical Table – Similar to Type 2, but historical data is stored in
a separate table to maintain the current record in the main table.
SCD Type 6: Hybrid – A combination of Type 1, Type 2, and Type 3, often used when
you need to keep both the latest data and the history of changes in one table.
import pandas as pd
Always select only the necessary columns to reduce the amount of data being
transferred.
Filter records early using the WHERE clause to reduce the data set size.
Avoid Subqueries:
Optimize Aggregations:
Use indexes and try to minimize the number of records being aggregated.
Batch Processing:
Use batch processing for large data sets to break the processing into smaller
chunks.
Whenever possible, avoid looping through records in stored procedures. Instead, use
set-based operations like JOIN or UPDATE.
Minimize Locking:
Use SQL Server’s execution plan to identify bottlenecks (e.g., scans, missing
indexes, etc.).
Storage: Use a Data Lake (S3/HDFS) for raw data and NoSQL for real-time data.
Processing: Use Apache Spark Streaming for real-time processing and Airflow for
batch processing.
Data Quality & Monitoring: Use tools like Apache Deequ and Prometheus.
Block ++
Fault Tolerance:
What is fault tolerance in the context of distributed systems, and why is it
important?
Spark uses RDD lineage to recompute lost data in case of a failure. Each RDD has a
record of the operations that created it, so Spark can recreate the lost partition
by applying the same operations to the data that led to it.
Driver failure: If the driver fails, the entire job fails. Spark requires the
driver to schedule tasks, manage the execution plan, and handle communication with
executors.
In Spark, the driver is responsible for splitting the job into tasks and
distributing them to different executors. The cluster manager (e.g., YARN, Mesos)
manages the resources and allocates nodes to executors. Tasks are executed in
parallel on the worker nodes.
Executors are worker nodes in Spark that execute the tasks assigned to them by the
driver. Each executor runs in its own JVM and processes a portion of the data.
Shuffling involves network I/O and disk operations, which are slow and resource-
intensive. To minimize shuffling:
It provides better error handling and debugging, as functions are often pure (i.e.,
they don’t change state outside their scope).
Use partitioning strategies to ensure that data is distributed evenly across nodes.
Used for comparing categories of data. It displays data with rectangular bars, with
the length of each bar proportional to the value it represents.
Line Chart:
Used to show trends over time. The data points are connected by straight lines.
Pie Chart:
Scatter Plot:
Used to show relationships between two continuous variables. Each point represents
a pair of values.
Histogram:
Used for showing the distribution of a single variable. It divides data into bins
and shows the frequency of values in each bin.
Heatmap:
Area Chart:
Similar to line charts but with the area below the line filled in.
Qdeep
Projection pushdown → only the name column
Qsimple
How do you perform data transformation in Spark?
Answer: Data transformation in Spark can be performed using operations like map,
filter, reduce, groupBy, and join. These transformations can be applied to RDDs,
DataFrames, and Datasets to manipulate data.
Qhard
1. Lineage Graph and DAG Formation
In Apache Spark, a lineage graph represents the logical execution plan of a Spark
job and tracks the transformations applied to the RDDs (Resilient Distributed
Datasets). This graph helps in recovering lost data and optimizing execution. The
Directed Acyclic Graph (DAG) is a crucial concept in Spark's execution model and is
the backbone of how Spark manages and schedules tasks.
When a Spark job is executed, Spark constructs a DAG of stages. Each stage consists
of operations that can be executed in parallel. The stages are linked through
dependencies, which are typically defined by narrow transformations (e.g., map,
filter) and wide transformations (e.g., groupBy, join).
The lineage graph tracks the sequence of transformations that have been applied to
RDDs. If data is lost in an RDD partition, Spark can use this lineage graph to
recompute the missing data from the source.
Lineage in RDDs:
When you apply transformations to an RDD, Spark maintains the information about the
operations that created the current RDD. This is called the lineage of the RDD.
If an RDD partition is lost, the lineage graph helps Spark recompute the data from
the original source by reapplying the transformations to the original data.
Example:
Imagine you have a sequence of transformations like this:
python
Copy
Edit
rdd1 = sc.textFile("data.txt") # Read data
rdd2 = rdd1.filter(lambda x: "error" in x) # Filter rows with "error"
rdd3 = rdd2.map(lambda x: x.split()) # Split each row into words
In this case, Spark constructs a DAG with the following stages:
The lineage of rdd3 will be the operations applied to rdd1 (i.e., filter and map),
which helps in data recomputation in case of failure.
2. RDDs Characteristics:
RDDs (Resilient Distributed Datasets) are a fundamental data structure in Spark,
providing an abstraction for working with distributed data in a fault-tolerant way.
Below are the key characteristics of RDDs:
Key Characteristics:
Immutable:
To perform transformations on an RDD, you need to create a new RDD from the
existing one (via transformations like map, filter, etc.).
Distributed:
RDDs are split into partitions that are distributed across the cluster. Each
partition is processed independently on different nodes in the cluster.
This allows Spark to take advantage of parallel processing.
Fault Tolerant:
RDDs are fault-tolerant due to lineage. If any partition of an RDD is lost, Spark
can recompute the lost data using the lineage information.
This is one of the key advantages of RDDs over other data structures.
Lazy Evaluation:
Transformations applied on RDDs are lazily evaluated. This means Spark does not
immediately execute operations until an action (like collect(), count(), or save())
is triggered.
This enables Spark to optimize the execution plan before running the jobs.
Parallel Processing:
Type-Agnostic:
RDDs are type-agnostic. You can store any kind of data (text, numbers, objects) in
an RDD.
Transformations: These are operations that transform one RDD into another, such as
map(), filter(), flatMap(), groupBy(), etc. They are lazy operations.
Actions: These trigger the actual execution of the RDD operations and return
results, such as count(), collect(), reduce(), etc.
Narrow Dependencies: Involve transformations where each partition of the parent RDD
is only used by one partition of the child RDD (e.g., map, filter). These
operations can be executed in parallel.
Summary:
Lineage Graph: Tracks the transformations applied to an RDD. It helps Spark
recompute lost data and optimize execution by forming a DAG.
Question: Write an SQL query to calculate the percentage difference between each
employee’s salary and the highest salary in their department.
Solution:
SELECT
emp_id,
emp_name,
dept_id,
salary,
MAX(salary) OVER (PARTITION BY dept_id) AS highest_salary_in_dept,
ROUND(((MAX(salary) OVER (PARTITION BY dept_id) - salary) /
MAX(salary) OVER (PARTITION BY dept_id)) * 100, 2) AS
percentage_difference
FROM
employees;
Question: Identify the teams that played the most matches in a given season.
Solution:
SELECT
team,
seasonyear,
COUNT(*) AS total_matches
FROM (
SELECT team1 AS team, seasonyear FROM matches
UNION ALL
SELECT team2 AS team, seasonyear FROM matches
) AS all_matches
GROUP BY team, seasonyear
ORDER BY total_matches DESC, seasonyear;
PySpark Questions
1. Percentage Contribution to Department Salary
Question: Calculate the percentage contribution of each employee’s salary to the
total salary of their department.
Solution:
Output:
+--------+-------+-------+------+-------------+---------------------+
| dept| emp_id| name|salary|total_salary |percentage_contribution|
+--------+-------+-------+------+-------------+---------------------+
| HR| 1| Alice| 50000| 120000 | 41.67 |
| HR| 2| Bob| 70000| 120000 | 58.33 |
| IT| 3|Charlie| 90000| 150000 | 60.00 |
| IT| 4| David| 60000| 150000 | 40.00 |
| Finance| 5| Eve| 80000| 80000 | 100.00 |
+--------+-------+-------+------+-------------+---------------------+