0% found this document useful (0 votes)

18 views13 pages

Q1. Difference Between Cache and Pe

The document discusses various concepts related to Apache Spark, including the differences between caching and persisting data, as well as repartitioning and coalescing data. It also covers optimization techniques for Spark, types of tables in Hive, handling data skewness, and best practices for SQL query optimization. Additionally, it explains the roles of drivers and executors in Spark, fault tolerance, and various data transformation methods.

Uploaded by

svlukhanina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views13 pages

Q1. Difference Between Cache and Pe

Uploaded by

svlukhanina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 13

Q1. Difference between cache and persist, repartition and coalesce.

Ans. Cache stores data in memory for quick access, while persist saves it to disk.
Repartition changes data distribution; coalesce reduces partitions.
Cache: Stores DataFrame in memory for faster access during subsequent operations.

Persist: Saves DataFrame to disk, allowing for fault tolerance but slower than
cache.

Repartition: Increases or decreases the number of partitions, potentially shuffling

data across nodes.

Coalesce: Reduces the number of partitions without a full shuffle, more efficient
for decreasing partitions.

Example of Cache: df.cache() to speed up iterative algorithms.

Example of Persist: df.persist(StorageLevel.MEMORY_AND_DISK) for fault tolerance.

Example of Repartition: df.repartition(10) to increase partitions for parallel

processing.

Example of Coalesce: df.coalesce(2) to reduce partitions after filtering.

Q2
Apache Spark optimization techniques improve performance by enabling efficient data
processing and better resource management.

Key optimization strategies include:

Use DataFrames and Datasets instead of RDDs, as they benefit from Catalyst
optimizer and Tungsten execution engine.

Leverage lazy evaluation to minimize unnecessary computations and trigger only the
necessary transformations during an action.

Apply partitioning to evenly distribute data across nodes:

Use repartition() to increase the number of partitions (e.g., before large joins).

Use coalesce() to reduce partitions without full data shuffle (e.g., before writing
to disk).

Minimize shuffling:

Prefer narrow transformations (like map, filter) where each partition depends only
on a single partition from the parent RDD.

Be cautious with wide transformations (like groupByKey, reduceByKey, join) that

require data to be shuffled across the network.

Broadcast smaller datasets:

Use broadcast joins to avoid costly shuffles when joining a large dataset with a
small dataset (broadcast(df)).

Use caching and persistence:

Cache frequently accessed intermediate results with df.cache() or df.persist() to

avoid recomputation and speed up iterative operations.
Types of Transformations:

Narrow Transformations: map(), filter(), union()

Wide Transformations: groupByKey(), reduceByKey(), join()

Shuffling:

Shuffling is the process of redistributing data across partitions, which is an

expensive operation. Minimizing unnecessary shuffles is crucial for Spark
performance optimization.

Q3. What are the different types of tables in Hive, and what are the differences
between them?
Ans. Hive supports two types of tables: Managed and External, each with distinct
data management and storage characteristics.
Managed Tables: Hive manages both the schema and the data. Dropping the table
deletes the data.

External Tables: Hive manages only the schema. Dropping the table does not delete
the data, which remains in the external storage.

Use Managed Tables for temporary data that can be recreated easily.

Use External Tables for data that is shared with other applications or needs to
persist beyond Hive.

Example of Managed Table: CREATE TABLE managed_table (id INT, name STRING);

Example of External Table: CREATE EXTERNAL TABLE external_table (id INT, name
STRING) LOCATION '/path/to/data';

Task SQL server connect

Feature RDD DataFrame Dataset

Level of abstraction Low High Middle
Type safety Yes No Yes
Performance optimization No Yes (Catalyst) Yes (Catalyst)
API style Functional SQL + Functional SQL + Functional
Best use case Unstructured or complex transformations Structured data (CSV,
JSON) Structured data with type safety (Scala/Java only)

SQL Server
url =
"jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb;user=myuser@myse
rver;password=mypassword;encrypt=true;"

df = spark.read.format("jdbc").option("url", url).option("dbtable",
"my_table").load()

properties = {
"user": "myuser@myserver",
"password": "mypassword",
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
df = spark.read.jdbc(url=url, table="my_table", properties=properties)

Нужно убедиться, что фаервол Azure SQL пропускает IP-адрес кластера Spark (иначе
соединение заблокируется).

Можно настроить Azure Active Directory (AAD) для более безопасной аутентификации
вместо обычного пароля.

df.write \
.mode("append") \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "target_table") \
.option("user", "username") \
.option("password", "password") \
.option("batchsize", "1000") \
.save()

Q2. Write code for printing duplicate numbers in a list.

Ans. This code identifies and prints duplicate numbers from a given list using a
dictionary to track occurrences.
Use a dictionary to count occurrences of each number.

Iterate through the list and update the count in the dictionary.

Print numbers that have a count greater than 1.

Example: For the list [1, 2, 3, 2, 4, 3], the output should be 2 and 3.

def print_duplicates(nums):
count = {}

for num in nums:

if num in count:
count[num] += 1
else:
count[num] = 1

for num, freq in count.items():

if freq > 1:
print(num)

# Example usage
nums = [1, 2, 3, 2, 4, 3]
print_duplicates(nums)

Summary:
Driver: Coordinates the job, creates a logical plan, and schedules tasks.

Executor: Runs the tasks and computes results.

Cluster Manager: Manages resources in the cluster.

RDD/DataFrame: Core abstractions for data.

Job Execution: The job goes through stages and tasks, possibly with shuffling, and
results are returned to the driver.

Q6. How will you handle data skewness in spark

Ans. Data skewness can be handled in Spark by using techniques like partitioning,
bucketing, and broadcasting.
Partitioning the data based on a key column can distribute the data evenly across
the cluster.

Bucketing can further divide the data into smaller buckets based on a hash
function.

Broadcasting small tables can reduce the amount of data shuffled across the
network.

Using dynamic allocation can also help in handling data skewness by allocating more
resources to tasks that are taking longer to complete.

ADD Qss

1) Difference Between OLAP and OLTP, Normalization and DeNor, Union and UnionAll,
2) SCD Types 3) ETL PIP line; We will get the data from Source CSV as Table data
But we Need a Destination in CSV, No data loading In SQL Server . which
transformation do you use? 4) How Will You do Optimization in SQL Querry and
StoreProcedure.

Third Normal Form (3NF) is a stage of database normalization designed to reduce

redundancy and ensure that data dependencies are logically structured.

SCD (Slowly Changing Dimension) refers to handling changes in data over time in a
data warehouse. There are several types of SCDs:

SCD Type 1: Overwrite – When a record’s value changes, the old value is overwritten
with the new value. This is suitable for non-historical data where only the latest
value is needed.

SCD Type 2: History Tracking – When a record changes, a new row is inserted with
the new values, and the old row is retained with a version history. This allows
tracking of changes over time.

Additional columns like start_date, end_date, and current_flag are often used to
indicate the active record.

SCD Type 3: Limited History – This approach keeps the current value and the
previous value in the same row. It’s suitable when you need to track only a limited
number of previous values (usually 1 or 2).

SCD Type 4: Historical Table – Similar to Type 2, but historical data is stored in
a separate table to maintain the current record in the main table.

SCD Type 6: Hybrid – A combination of Type 1, Type 2, and Type 3, often used when
you need to keep both the latest data and the history of changes in one table.

import pandas as pd

# Load the data

df = pd.read_csv('source_file.csv')

# Perform transformations (example: remove rows with null values)

df_cleaned = df.dropna()

# Save the cleaned data to a new CSV file

df_cleaned.to_csv('destination_file.csv', index=False)

4) SQL Query and Stored Procedure Optimization

When optimizing SQL queries and stored procedures, the key goals are to reduce
query execution time and minimize resource consumption (e.g., CPU, memory, disk
I/O). Below are some tips:

SQL Query Optimization:

Indexes:

Use appropriate indexes (e.g., on frequently queried columns).

Be cautious about over-indexing, as it may slow down insert/update operations.

Avoid Select * (wildcard):

Always select only the necessary columns to reduce the amount of data being
transferred.

Use Joins Efficiently:

Prefer inner joins over outer joins when possible.

Make sure that join conditions are indexed.

Use WHERE Clause Effectively:

Filter records early using the WHERE clause to reduce the data set size.

Avoid Subqueries:

Subqueries can often be rewritten as joins for better performance.

Optimize Aggregations:

Use indexes and try to minimize the number of records being aggregated.

Batch Processing:

Use batch processing for large data sets to break the processing into smaller
chunks.

Stored Procedure Optimization:

Avoid Loops for Data Processing:

Whenever possible, avoid looping through records in stored procedures. Instead, use
set-based operations like JOIN or UPDATE.

Use Temporary Tables Wisely:

For large transformations, temporary tables can improve performance by breaking

down large operations into smaller, more manageable chunks.

Minimize Locking:

Optimize queries within stored procedures to minimize locking and deadlocks.

Proper Exception Handling:

Ensure efficient error handling to prevent unnecessary retries or rollback, which

can affect performance.

Analyze Execution Plans:

Use SQL Server’s execution plan to identify bottlenecks (e.g., scans, missing
indexes, etc.).

Final Architecture Overview:

Ingestion: Use Kafka for real-time data streaming.

Storage: Use a Data Lake (S3/HDFS) for raw data and NoSQL for real-time data.

Processing: Use Apache Spark Streaming for real-time processing and Airflow for
batch processing.

Transformation & Enrichment: Use Spark SQL or custom Python functions.

Data Quality & Monitoring: Use tools like Apache Deequ and Prometheus.

Delivery: Send processed data to dashboards or data warehouses.

Block ++
Fault Tolerance:
What is fault tolerance in the context of distributed systems, and why is it
important?

Fault tolerance is the ability of a system to continue functioning properly even in

the event of failures (e.g., machine failure, network failure). In distributed
computing, fault tolerance is essential to ensure that the system remains available
and consistent despite component failures.

How does Spark handle faults in a distributed environment?

Spark uses RDD lineage to recompute lost data in case of a failure. Each RDD has a
record of the operations that created it, so Spark can recreate the lost partition
by applying the same operations to the data that led to it.

What happens if an executor or driver fails in Spark?

Executor failure: If an executor fails, the tasks assigned to it are reallocated to

other executors. The lost partitions can be recomputed from the RDD lineage if
necessary.

Driver failure: If the driver fails, the entire job fails. Spark requires the
driver to schedule tasks, manage the execution plan, and handle communication with
executors.

In Spark, the driver is responsible for splitting the job into tasks and
distributing them to different executors. The cluster manager (e.g., YARN, Mesos)
manages the resources and allocates nodes to executors. Tasks are executed in
parallel on the worker nodes.

4. Executor and Driver Failures:

What is the role of the driver in a Spark job?

The driver is responsible for the overall orchestration of a Spark job. It

schedules tasks, manages the DAG, and coordinates with the cluster manager to
allocate resources.

What is the role of an executor in Spark?

Executors are worker nodes in Spark that execute the tasks assigned to them by the
driver. Each executor runs in its own JVM and processes a portion of the data.

What is shuffling in Spark, and when does it occur?

Shuffling is the process of redistributing data across different nodes, typically

as a result of wide transformations like groupByKey, reduceByKey, join, etc. It
involves transferring data between nodes, which can be costly in terms of time and
resources.

Why is shuffling expensive, and how can we minimize it?

Shuffling involves network I/O and disk operations, which are slow and resource-
intensive. To minimize shuffling:

Use narrow transformations (e.g., map, filter) instead of wide transformations

(e.g., groupBy, join).

Use broadcast joins for small datasets.

Repartition or coalesce data strategically to avoid excessive shuffling.

6. Functional Data Engineering:

What is functional programming in the context of data engineering?

Functional programming focuses on immutability, first-class functions, and avoiding

side effects. In Spark, functional transformations like map, filter, and reduce are
commonly used. These operations are typically applied to immutable datasets (e.g.,
RDDs or DataFrames).

What are the benefits of using functional programming in data engineering?

It promotes cleaner, more modular, and maintainable code.

It enables parallel processing and better optimization.

It provides better error handling and debugging, as functions are often pure (i.e.,
they don’t change state outside their scope).

7. Best Practices in Distributed Data Engineering:

How can you optimize Spark jobs for better performance?

Use DataFrames and Datasets instead of RDDs for better optimization.

Minimize shuffling by avoiding wide transformations where possible.

Use partitioning strategies to ensure that data is distributed evenly across nodes.

Cache frequently used data to reduce recomputations.

Use broadcast joins for smaller datasets to avoid shuffling.

++
2. Different Types of Charts
Bar Chart:

Used for comparing categories of data. It displays data with rectangular bars, with
the length of each bar proportional to the value it represents.

Use case: Comparing sales performance across different months.

Line Chart:

Used to show trends over time. The data points are connected by straight lines.

Use case: Displaying stock market trends or temperature variations.

Pie Chart:

Used to represent data as a percentage of a whole. Each slice represents a

category's contribution to the total.

Use case: Market share distribution among competitors.

Scatter Plot:

Used to show relationships between two continuous variables. Each point represents
a pair of values.

Use case: Analyzing correlation between advertising budget and sales.

Histogram:

Used for showing the distribution of a single variable. It divides data into bins
and shows the frequency of values in each bin.

Use case: Distribution of age groups in a population survey.

Heatmap:

Displays data in matrix form with color coding to represent values.

Use case: Displaying correlations between multiple variables or showing website

user activity over time.

Area Chart:

Similar to line charts but with the area below the line filled in.

Use case: Cumulative sales over time.

Qdeep
Projection pushdown → only the name column

Predicate pushdown → only rows where salary > 50000

if adaptive coalesce partitions enabled spark automatically merge small shuffle

partitions
if adaptive join enabled spark when planned shuffle join switches to the
broadcast/soft merge/shuffle hash join

adaptive skew join - detects skewed partitions

Qsimple
How do you perform data transformation in Spark?
Answer: Data transformation in Spark can be performed using operations like map,
filter, reduce, groupBy, and join. These transformations can be applied to RDDs,
DataFrames, and Datasets to manipulate data.

The Catalyst Optimizer is a query optimization framework in Spark SQL that

automatically optimizes the logical and physical execution plans to improve query
performance.

Explain the concept of lazy evaluation in Spark.

Answer: Lazy evaluation means that Spark does not immediately execute
transformations on RDDs, DataFrames, or Datasets. Instead, it builds a logical plan
of the transformations and only executes them when an action (like collect or save)
is called. This optimization reduces the number of passes over the data.

How do you manage Spark applications on Databricks clusters?

Answer: Spark applications on Databricks clusters can be managed by configuring
clusters (choosing instance types, auto-scaling options), monitoring cluster
performance, and using Databricks job scheduling to automate workflows.

What is the difference between wide and narrow transformations in Spark?

Answer: Narrow transformations (like map and filter) involve data shuffling within
a single partition, while wide transformations (like groupByKey and join) involve
data shuffling across multiple partitions, which can be more resource-intensive.

How do you secure data and manage permissions in Databricks?

Answer: Data security and permissions can be managed using features like encryption
at rest and in transit, role-based access control (RBAC), secure cluster
configurations, and integration with AWS IAM or Azure Active Directory.

How do you use Databricks to process real-time data?

Answer: Real-time data processing in Databricks can be achieved using Spark
Streaming or Structured Streaming. These tools allow you to ingest, process, and
analyze streaming data from sources like Kafka, Kinesis, or Event Hubs.

Qhard
1. Lineage Graph and DAG Formation
In Apache Spark, a lineage graph represents the logical execution plan of a Spark
job and tracks the transformations applied to the RDDs (Resilient Distributed
Datasets). This graph helps in recovering lost data and optimizing execution. The
Directed Acyclic Graph (DAG) is a crucial concept in Spark's execution model and is
the backbone of how Spark manages and schedules tasks.

DAG (Directed Acyclic Graph) Formation:

DAG is a directed graph where each node represents a Spark operation (such as a
transformation), and the edges represent the data flow between operations.

DAG helps Spark optimize query execution by:

Breaking down the job into smaller tasks.

Enabling task scheduling in stages.

Determining dependencies between tasks.

When a Spark job is executed, Spark constructs a DAG of stages. Each stage consists
of operations that can be executed in parallel. The stages are linked through
dependencies, which are typically defined by narrow transformations (e.g., map,
filter) and wide transformations (e.g., groupBy, join).

The lineage graph tracks the sequence of transformations that have been applied to
RDDs. If data is lost in an RDD partition, Spark can use this lineage graph to
recompute the missing data from the source.

Lineage in RDDs:
When you apply transformations to an RDD, Spark maintains the information about the
operations that created the current RDD. This is called the lineage of the RDD.

If an RDD partition is lost, the lineage graph helps Spark recompute the data from
the original source by reapplying the transformations to the original data.

Example:
Imagine you have a sequence of transformations like this:

python
Copy
Edit
rdd1 = sc.textFile("data.txt") # Read data
rdd2 = rdd1.filter(lambda x: "error" in x) # Filter rows with "error"
rdd3 = rdd2.map(lambda x: x.split()) # Split each row into words
In this case, Spark constructs a DAG with the following stages:

Stage 1: Read data (rdd1)

Stage 2: Filter rows (rdd2)

Stage 3: Map operation (rdd3)

The lineage of rdd3 will be the operations applied to rdd1 (i.e., filter and map),
which helps in data recomputation in case of failure.

2. RDDs Characteristics:
RDDs (Resilient Distributed Datasets) are a fundamental data structure in Spark,
providing an abstraction for working with distributed data in a fault-tolerant way.
Below are the key characteristics of RDDs:

Key Characteristics:
Immutable:

RDDs are immutable, meaning once created, they cannot be modified.

To perform transformations on an RDD, you need to create a new RDD from the
existing one (via transformations like map, filter, etc.).

Distributed:

RDDs are split into partitions that are distributed across the cluster. Each
partition is processed independently on different nodes in the cluster.
This allows Spark to take advantage of parallel processing.

Fault Tolerant:

RDDs are fault-tolerant due to lineage. If any partition of an RDD is lost, Spark
can recompute the lost data using the lineage information.

This is one of the key advantages of RDDs over other data structures.

Lazy Evaluation:

Transformations applied on RDDs are lazily evaluated. This means Spark does not
immediately execute operations until an action (like collect(), count(), or save())
is triggered.

This enables Spark to optimize the execution plan before running the jobs.

Parallel Processing:

Operations on RDDs are designed to be parallelized across multiple nodes in the

cluster. Spark automatically splits the RDD into smaller chunks (partitions), and
these partitions are processed concurrently by different nodes.

Type-Agnostic:

RDDs are type-agnostic. You can store any kind of data (text, numbers, objects) in
an RDD.

Transformations vs. Actions:

Transformations: These are operations that transform one RDD into another, such as
map(), filter(), flatMap(), groupBy(), etc. They are lazy operations.

Actions: These trigger the actual execution of the RDD operations and return
results, such as count(), collect(), reduce(), etc.

Narrow vs Wide Dependencies:

Narrow Dependencies: Involve transformations where each partition of the parent RDD
is only used by one partition of the child RDD (e.g., map, filter). These
operations can be executed in parallel.

Wide Dependencies: Involve transformations where multiple partitions of the parent

RDD are shuffled across partitions (e.g., groupByKey, reduceByKey). These require
more expensive shuffling and are usually more time-consuming.

Example of Narrow and Wide Dependencies:

python
Copy
Edit
rdd1 = sc.parallelize([1, 2, 3, 4, 5])

# Narrow dependency (can be done in parallel)

rdd2 = rdd1.map(lambda x: x * 2)

# Wide dependency (requires shuffling)

rdd3 = rdd1.reduceByKey(lambda x, y: x + y)
map() is a narrow transformation, and it can be done in parallel.
reduceByKey() is a wide transformation, which requires shuffling data across
partitions.

Summary:
Lineage Graph: Tracks the transformations applied to an RDD. It helps Spark
recompute lost data and optimize execution by forming a DAG.

RDD Characteristics: RDDs are immutable, distributed, fault-tolerant, support lazy

evaluation, and provide parallel processing. RDDs use narrow and wide dependencies
to determine execution strategies.

Question: Write an SQL query to calculate the percentage difference between each
employee’s salary and the highest salary in their department.

Solution:

SELECT
emp_id,
emp_name,
dept_id,
salary,
MAX(salary) OVER (PARTITION BY dept_id) AS highest_salary_in_dept,
ROUND(((MAX(salary) OVER (PARTITION BY dept_id) - salary) /
MAX(salary) OVER (PARTITION BY dept_id)) * 100, 2) AS
percentage_difference
FROM
employees;

Question: Identify the teams that played the most matches in a given season.

Solution:

SELECT
team,
seasonyear,
COUNT(*) AS total_matches
FROM (
SELECT team1 AS team, seasonyear FROM matches
UNION ALL
SELECT team2 AS team, seasonyear FROM matches
) AS all_matches
GROUP BY team, seasonyear
ORDER BY total_matches DESC, seasonyear;

PySpark Questions
1. Percentage Contribution to Department Salary
Question: Calculate the percentage contribution of each employee’s salary to the
total salary of their department.

Solution:

from pyspark.sql import SparkSession

from pyspark.sql.functions import col, sum, round

# Initialize Spark session

spark = SparkSession.builder.appName("PercentageContribution").getOrCreate()
# Sample data
data = [
(1, "Alice", "HR", 50000),
(2, "Bob", "HR", 70000),
(3, "Charlie", "IT", 90000),
(4, "David", "IT", 60000),
(5, "Eve", "Finance", 80000),
]
columns = ["emp_id", "name", "dept", "salary"]
df = spark.createDataFrame(data, schema=columns)
# Calculate total salary per department
dept_total_salary = df.groupBy("dept").agg(sum("salary").alias("total_salary"))
# Join the total salary with the original DataFrame
df_with_total = df.join(dept_total_salary, on="dept", how="inner")
# Calculate percentage contribution
result = df_with_total.withColumn(
"percentage_contribution",
round((col("salary") / col("total_salary")) * 100, 2)
)
result.show()
This PySpark code calculates the total salary per department, joins it with the
original DataFrame, and computes each employee’s percentage contribution.

Output:

+--------+-------+-------+------+-------------+---------------------+
| dept| emp_id| name|salary|total_salary |percentage_contribution|
+--------+-------+-------+------+-------------+---------------------+
| HR| 1| Alice| 50000| 120000 | 41.67 |
| HR| 2| Bob| 70000| 120000 | 58.33 |
| IT| 3|Charlie| 90000| 150000 | 60.00 |
| IT| 4| David| 60000| 150000 | 40.00 |
| Finance| 5| Eve| 80000| 80000 | 100.00 |
+--------+-------+-------+------+-------------+---------------------+

BigData Nptel
No ratings yet
BigData Nptel
813 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Oracle Book in Hindi PDF Download
No ratings yet
Oracle Book in Hindi PDF Download
51 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Spark QA
No ratings yet
Spark QA
34 pages
Oracle Built in Packages.9781565923751.16316
0% (1)
Oracle Built in Packages.9781565923751.16316
841 pages
12 Reduction of ER Diagram To Table
100% (2)
12 Reduction of ER Diagram To Table
9 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Journal
No ratings yet
Journal
47 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Databricks Vs SQL Cheat Sheet
No ratings yet
Databricks Vs SQL Cheat Sheet
11 pages
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
No ratings yet
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
25 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Pyspark
No ratings yet
Pyspark
44 pages
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
No ratings yet
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
76 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Senior Data Engineer Qs
No ratings yet
Senior Data Engineer Qs
7 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
Wipro Data Analyst Interview Questions
No ratings yet
Wipro Data Analyst Interview Questions
29 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Unit - 5 - Chapter 3 - Creating, Updating, and Deleting Documents in MongoDB
No ratings yet
Unit - 5 - Chapter 3 - Creating, Updating, and Deleting Documents in MongoDB
60 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Tiger Analytics 1735834470
No ratings yet
Tiger Analytics 1735834470
27 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
Pyspark 1
No ratings yet
Pyspark 1
7 pages
DBMS Practical File
No ratings yet
DBMS Practical File
35 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Prac
No ratings yet
Prac
4 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
openSAP Hana9 Week 1 All Slides
No ratings yet
openSAP Hana9 Week 1 All Slides
81 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Notes
No ratings yet
Notes
5 pages
LT Mindtree
No ratings yet
LT Mindtree
3 pages
Databricks RealQuestions
No ratings yet
Databricks RealQuestions
9 pages
Word 1674051505685
No ratings yet
Word 1674051505685
44 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Spark Material
No ratings yet
Spark Material
6 pages
Spark Best Practices
No ratings yet
Spark Best Practices
10 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
CS 5226: Database Administration and Performance Tuning
No ratings yet
CS 5226: Database Administration and Performance Tuning
14 pages
04 Handout 1
No ratings yet
04 Handout 1
1 page
A4629ac494 Syllabus
No ratings yet
A4629ac494 Syllabus
3 pages
KBKrishnaTeja Interview Questions
No ratings yet
KBKrishnaTeja Interview Questions
2 pages
SQL Part 2 1733732359
No ratings yet
SQL Part 2 1733732359
10 pages
14.1 Databases
No ratings yet
14.1 Databases
25 pages
Data Warehousing With SQL Server 2012 (Core)
No ratings yet
Data Warehousing With SQL Server 2012 (Core)
8 pages
BI Combo Pack
No ratings yet
BI Combo Pack
12 pages
Banking Management - Project
No ratings yet
Banking Management - Project
14 pages
Course Code CSE3001 CT C LTP 4 Prerequisite: Objectives
No ratings yet
Course Code CSE3001 CT C LTP 4 Prerequisite: Objectives
7 pages
ZPRO - FB60 - BDC1 (New Version)
No ratings yet
ZPRO - FB60 - BDC1 (New Version)
7 pages
CSP008 Practical 09
No ratings yet
CSP008 Practical 09
3 pages
ITS232 Introduction To Database Management Systems: Normalization of Database Tables (Part II: The Process)
No ratings yet
ITS232 Introduction To Database Management Systems: Normalization of Database Tables (Part II: The Process)
38 pages
What Is Data Mining Tools
No ratings yet
What Is Data Mining Tools
3 pages
Er Diagram (Database)
No ratings yet
Er Diagram (Database)
4 pages
Presentation Using Myanmar Language in GIS Datasets MIMU
No ratings yet
Presentation Using Myanmar Language in GIS Datasets MIMU
14 pages
02b - SQL-DDL - CSC 343
No ratings yet
02b - SQL-DDL - CSC 343
6 pages
SQL - Project Instruction
No ratings yet
SQL - Project Instruction
3 pages
PHP & Laravel
No ratings yet
PHP & Laravel
2 pages
Writing Queries
No ratings yet
Writing Queries
5 pages
Individual Assignment 2
No ratings yet
Individual Assignment 2
4 pages
QA AJP 22517 Exp 18
No ratings yet
QA AJP 22517 Exp 18
2 pages
Cursor in Oracle
No ratings yet
Cursor in Oracle
1 page
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet