0% found this document useful (0 votes)
26 views3 pages

LT Mindtree

The document outlines various SQL conditions, methods for handling missing data in PySpark, and the backend processes involved when submitting a Spark job in Databricks. It also discusses query acceleration techniques, ways to delete duplicate records, performance optimization in Spark, data transfer between dashboards, and hands-on experience with big data tools. Additionally, it describes the SSO process between Snowflake and Azure Active Directory, emphasizing SAML-based authentication and trust relationships.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views3 pages

LT Mindtree

The document outlines various SQL conditions, methods for handling missing data in PySpark, and the backend processes involved when submitting a Spark job in Databricks. It also discusses query acceleration techniques, ways to delete duplicate records, performance optimization in Spark, data transfer between dashboards, and hands-on experience with big data tools. Additionally, it describes the SSO process between Snowflake and Azure Active Directory, emphasizing SAML-based authentication and trust relationships.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

1) SQL what are the condition used in sql?

when we have table but we want create

SQL conditions are used to filter data based on specified criteria. Common
conditions include WHERE, AND, OR, IN, BETWEEN, etc.
Common SQL conditions include WHERE, AND, OR, IN, BETWEEN, LIKE, etc.

Conditions are used to filter data based on specified criteria in SQL queries.

Examples: WHERE salary > 50000, AND department = 'IT', OR age < 30

2) How to handle missing data in pyspark dataframe.

Handle missing data in pyspark dataframe by using functions like dropna, fillna, or
replace.
Use dropna() function to remove rows with missing data

Use fillna() function to fill missing values with a specified value

Use replace() function to replace missing values with a specified value

3) In Databricks, when a spark is submitted, what happens at backend. Explain the


flow?

When a spark is submitted in Databricks, several backend processes are triggered to


execute the job.
The submitted spark job is divided into tasks by the Spark driver.

The tasks are then scheduled to run on the available worker nodes in the cluster.

The worker nodes execute the tasks and return the results to the driver.

The driver aggregates the results and presents them to the user.

Various optimizations such as data shuffling and caching may be applied during the
execution process.

4) How does query acceleration speed up query processing?


Ans. Query acceleration speeds up query processing by optimizing query execution
and reducing the time taken to retrieve data.
Query acceleration uses techniques like indexing, partitioning, and caching to
optimize query execution.

It reduces the time taken to retrieve data by minimizing disk I/O and utilizing in-
memory processing.

Examples include using columnar storage formats like Parquet or optimizing join
operations.

Q5. How would you delete duplicate records from a table?


Ans. To delete duplicate records from a table, you can use the DELETE statement
with a self-join or subquery.
Identify the duplicate records using a self-join or subquery

Use the DELETE statement to remove the duplicate records

Consider using a temporary table to store the unique records before deleting the
duplicates
Q6. duplicate table how we create? window function? types of joins? explain each
join?
Ans. To duplicate a table, use CREATE TABLE AS or INSERT INTO SELECT. Window
functions are used for calculations across a set of table rows. Types of joins
include INNER, LEFT, RIGHT, and FULL OUTER joins.
To duplicate a table, use CREATE TABLE AS or INSERT INTO SELECT

Window functions are used for calculations across a set of table rows

Types of joins include INNER, LEFT, RIGHT, and FULL OUTER joins

Explain each join: INNER - returns rows when there is at least one match in both
tables,
LEFT - returns all rows from the left table and the matched rows from the right
table,
RIGHT - returns all rows from the right table and the matched rows from the left
table,
FULL OUTER - returns rows when there is a match in one of the tables

Q7. How do you do to performance optimization in Spark?


Ans. Performance optimization in Spark involves tuning configurations, optimizing
code, and utilizing caching.
Tune Spark configurations such as executor memory, cores, and parallelism

Optimize code by reducing unnecessary shuffles, using efficient transformations,


and avoiding unnecessary data movements

Utilize caching to store intermediate results in memory for faster access

Q8. How to filter data from A dashboard to B dashboard?


Ans. Use data connectors or APIs to extract and transfer data from one dashboard to
another.
Utilize data connectors or APIs provided by the dashboard platforms to extract data
from A dashboard.

Transform the data as needed to match the format of B dashboard.

Use data connectors or APIs of B dashboard to transfer the filtered data from A
dashboard to B dashboard.

Q9. Do you have hands on experience on big data tools


Ans. Yes, I have hands-on experience with big data tools.
I have worked extensively with Hadoop, Spark, and Kafka.

I have experience with data ingestion, processing, and storage using these tools.

I have also worked with NoSQL databases like Cassandra and MongoDB.

I am familiar with data warehousing concepts and have worked with tools like
Redshift and Snowflake.

Q10. 4) Describe the SSO process between Snowflake and Azure Active Directory.
Ans. SSO process between Snowflake and Azure Active Directory involves configuring
SAML-based authentication.
Configure Snowflake to use SAML authentication with Azure AD as the identity
provider

Set up a trust relationship between Snowflake and Azure AD


Users authenticate through Azure AD and are granted access to Snowflake resources

SSO eliminates the need for separate logins and passwords for Snowflake and Azure
AD

You might also like