0% found this document useful (0 votes)

11 views22 pages

Deloitte Data Engineer Interview Experience (0-3 Yoe)

The document outlines key concepts and interview questions for a Data Engineer position at Deloitte, covering SQL queries, indexing, window functions, data cleaning in Python, and cloud-based data warehouse architectures. It also compares Pandas and PySpark, explains exception handling in ETL pipelines, and differentiates between OLAP and OLTP databases. Additionally, it provides practical examples and code snippets for data manipulation and database interaction.

Uploaded by

Guruprasad p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views22 pages

Deloitte Data Engineer Interview Experience (0-3 Yoe)

Uploaded by

Guruprasad p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

DELOITTE DATA ENGINEER INTERVIEW

EXPERIENCE (0-3 YoE)

1. Write a query to retrieve the top 3 highest salaries from an employee

table.
SELECT DISTINCT salary

FROM employee

ORDER BY salary DESC

LIMIT 3;

Alternatively, if there are duplicate salaries and we need an accurate top 3:

SELECT salary

FROM (

SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk

FROM employee

) ranked_salaries

WHERE rnk <= 3;

2. Explain the difference between a clustered and a non-clustered index.

• Clustered Index: Determines the physical order of data in a table. A table can have
only one clustered index.

• Non-Clustered Index: Creates a separate structure to store index data. A table can
have multiple non-clustered indexes.

3. What are window functions in SQL? Provide examples.

Window functions perform calculations across a set of table rows related to the current
row. Example:

SELECT employee_id, department, salary,

RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank

FROM employee;

4. How would you optimize a query that takes too long to execute?
• Use Indexes on frequently queried columns.

• Avoid SELECT *; retrieve only required columns.

• Optimize JOINs by indexing keys.

• Use EXPLAIN PLAN to analyze query performance.

• Normalize the database structure.

• Avoid redundant subqueries and use CTEs.

5. Write a query to find duplicate records in a table.

SELECT column1, column2, COUNT(*)

FROM table_name

GROUP BY column1, column2

HAVING COUNT(*) > 1;

6. How do you handle NULL values in SQL?

• Use COALESCE() to replace NULLs:

SELECT COALESCE(salary, 0) FROM employee;

• Use IS NULL / IS NOT NULL in conditions.

• Use IFNULL() (MySQL) or NVL() (Oracle).

7. Explain the difference between DELETE, TRUNCATE, and DROP.

• DELETE: Removes specific rows, can be rolled back, and logs each row deletion.

• TRUNCATE: Removes all rows, faster than DELETE, cannot be rolled back.

• DROP: Removes the entire table from the database.

8. What is a CTE (Common Table Expression), and how is it different from a

subquery?
• CTE: Temporary result set used in complex queries, improving readability.
• Subquery: A nested query inside another query. Example CTE:

WITH SalesData AS (

SELECT customer_id, SUM(amount) AS total_sales

FROM sales

GROUP BY customer_id

SELECT * FROM SalesData WHERE total_sales > 1000;

9. Write a query to calculate the running total of sales for each month.
SELECT month, sales,

SUM(sales) OVER (ORDER BY month) AS running_total

FROM sales_table;

10. Explain the difference between INNER JOIN, LEFT JOIN, and FULL
OUTER JOIN.
• INNER JOIN: Returns only matching records.

• LEFT JOIN: Returns all records from the left table and matching records from the
right.

• FULL OUTER JOIN: Returns all records from both tables, filling non-matching
records with NULLs. Example:

11 . How would you use Python for data cleaning and transformation?
Python is widely used for data cleaning and transformation in data analysis and
machine learning workflows. You can use libraries like pandas, NumPy, and re to perform
various data preparation tasks efficiently.

1.Handling Missing Data

import pandas as pd

df = pd.read_csv("data.csv")
# Check for missing values

print(df.isnull().sum())

# Fill missing values with mean/median/mode

df["column_name"].fillna(df["column_name"].mean(), inplace=True)

# Drop rows/columns with missing values

df.dropna(inplace=True)

2. Removing Duplicates

df.drop_duplicates(inplace=True)

3. Data Type Conversion

df["date_column"] = pd.to_datetime(df["date_column"])

df["numeric_column"] = pd.to_numeric(df["numeric_column"], errors="coerce")

4. String Cleaning (Removing Special Characters, Lowercasing, etc.)

df["text_column"] = df["text_column"].str.lower().str.replace(r"[^a-zA-Z0-9]", " ", regex=True)

5. Handling Outliers

You can remove outliers or cap them based on a threshold.

# Remove rows with outliers

df = df[df['column_name'] < threshold_value]

# Alternatively, cap outliers

df['column_name'] = df['column_name'].clip(lower=min_value, upper=max_value)

6. Data Transformation

Transform data, such as changing column names or creating new features:

• Rename Columns:

df.rename(columns={'old_name': 'new_name'}, inplace=True)

• Create New Columns:

df['new_column'] = df['column1'] + df['column2']

7. Filtering Data

You can filter data based on certain conditions:

df_filtered = df[df['column_name'] > 50] # Rows where column_name > 50

8. Handling Categorical Data

Convert categorical data to numerical values using encoding techniques like Label
Encoding or One-Hot Encoding.

# Label Encoding

df['encoded_column'] = df['category_column'].map({'Category1': 0, 'Category2': 1})

# One-Hot Encoding

df = pd.get_dummies(df, columns=['category_column'])

9. Normalization/Standardization

You may need to scale numerical data for machine learning models.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df['scaled_column'] = scaler.fit_transform(df[['column_name']])

10. Date/Time Transformation

Extract specific components from datetime columns or create new time-based

features.

df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month

df['day_of_week'] = df['date_column'].dt.dayofweek

11. Save Cleaned Data

After all transformations, save the cleaned dataset.

df.to_csv("cleaned_data.csv", index=False)

Summary

• Pandas is your main tool for data manipulation, cleaning, and transformation.

• Handle missing data, outliers, duplicates, and incorrect data types.

• Use encoding and scaling techniques for categorical and numerical features.

• Transform dates, filter data, and create new features as needed.

12.Write a Python script to connect to a database and fetch data using

SQL queries.
Here's a Python script to connect to a MySQL database and fetch data using SQL
queries. This script uses the mysql-connector-python library to establish the connection.

Steps in the Script:

1 . Install the required package (if not already installed):

pip install mysql-connector-python

2 .Establish a connection with the database.

3 .Execute an SQL query and fetch results.
4️ .Handle exceptions and close the connection properly.

Python Script:

import mysql.connector

# Database connection details

db_config = {

"host": "your_host", # e.g., "localhost" or an IP address

"user": "your_username", # e.g., "root"

"password": "your_password",

"database": "your_database"

try:

# Establishing connection

conn = mysql.connector.connect(**db_config)

if conn.is_connected():

print("Connected to the database!")

# Create a cursor object

cursor = conn.cursor()

# SQL query to fetch data

query = "SELECT * FROM your_table LIMIT 10;"

cursor.execute(query)

# Fetch and print results

results = cursor.fetchall()

for row in results:

print(row)
except mysql.connector.Error as err:

print(f"Error: {err}")

finally:

# Close the connection

if conn.is_connected():

cursor.close()

conn.close()

print("Connection closed.")

Modifications for Different Databases:

• PostgreSQL: Use psycopg2

• SQL Server: Use pyodbc

• SQLite: Use sqlite3

13 .Explain the difference between Pandas and PySpark for data

manipulation.
Both Pandas and PySpark are popular Python libraries for data manipulation, but they are
suited for different use cases. Here’s a comparison:

Feature Pandas PySpark

Best For Small to medium-sized datasets Big data & distributed computing

Faster for large datasets (distributed

Speed Fast for small datasets
processing)

Handles up to a few million Handles terabytes of data across

Data Size
rows efficiently multiple machines
Feature Pandas PySpark

Single-threaded (limited by Multi-threaded (distributed via Spark

Parallelism
RAM) clusters)

Uses disk storage & distributed

Memory Usage Stores all data in RAM
memory

Ease of Use Simple, intuitive API More complex but scalable

Installation Requires only pandas Requires pyspark and Spark setup

Processing Works in-memory on a single Uses Spark's distributed computing

Engine machine engine

When to Use?

Pandas → Best for small to medium datasets (Excel, CSV, databases).

PySpark → Best for large-scale data (big data, cloud-based, distributed processing).

14.How would you handle exceptions in a Python-based ETL pipeline?

In an ETL (Extract, Transform, Load) pipeline, handling exceptions is crucial for ensuring
that the process runs smoothly and errors are managed properly. Here's how you can
handle exceptions in a Python-based ETL pipeline.

1 . Using Try-Except Blocks

You can surround each ETL step (Extract, Transform, Load) with try-except blocks to catch
specific errors and take appropriate actions like logging the error or retrying the process.

2 .General Structure for ETL Pipeline:

import logging

import time

# Setup logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s -

%(message)s')
def extract():

try:

# Simulating data extraction (e.g., from a database or file)

logging.info("Extracting data...")

data = ["data1", "data2", "data3"]

if not data:

raise ValueError("No data extracted!")

return data

except Exception as e:

logging.error(f"Error in extraction: {e}")

raise # Re-raise the error after logging

def transform(data):

try:

# Simulating data transformation (e.g., cleaning, filtering)

logging.info("Transforming data...")

transformed_data = [item.upper() for item in data] # Example transformation

return transformed_data

except Exception as e:

logging.error(f"Error in transformation: {e}")

raise # Re-raise the error after logging

def load(data):

try:

# Simulating loading data (e.g., inserting into a database)

logging.info("Loading data...")

if not data:

raise ValueError("No data to load!")

# Assume data is successfully loaded

logging.info(f"Data loaded: {data}")

except Exception as e:

logging.error(f"Error in loading: {e}")

raise # Re-raise the error after logging

def run_etl():

try:

data = extract()

transformed_data = transform(data)

load(transformed_data)

except Exception as e:

logging.error(f"ETL process failed: {e}")

if __name__ == "__main__":

while True:

try:

run_etl()

logging.info("ETL pipeline completed successfully.")

break # Exit after successful completion

except Exception as e:

logging.error(f"ETL pipeline failed: {e}")

logging.info("Retrying in 10 seconds...")
time.sleep(10) # Retry after 10 seconds if it fails

Key Components of Exception Handling:

1. Logging: Use logging to capture detailed logs for debugging. It provides different
levels like INFO, ERROR, WARNING, etc.

2. Specific Exceptions: Catch specific exceptions (e.g., ValueError, ConnectionError)

to handle different scenarios separately.

3. Retries: If a step fails (e.g., network issue), retry it with a delay (time.sleep).

4️. Raising Exceptions: After catching and logging an exception, raise it again if you
want the pipeline to fail and stop, or handle it at a higher level.

5. Data Validation: Before moving to the next ETL step, check that the data is valid
(e.g., non-empty).

3 . Advanced Exception Handling

For more advanced scenarios, you can:

• Use custom exception classes for specific errors.

• Integrate with a message queue (e.g., RabbitMQ) for reprocessing failed steps.

• Set up alerting mechanisms (e.g., sending an email or Slack notification) if the

pipeline fails.

15.What libraries have you used for data processing in Python (e.g.,
Pandas, NumPy)?
1 . Pandas – Best for structured data (CSV, Excel, SQL tables)

✔ Data manipulation: DataFrame and Series

✔ Handling missing values: .fillna(), .dropna()
✔ Aggregations: .groupby(), .pivot_table()
✔ Merging & joining: .merge(), .concat()

2 . NumPy – Best for numerical computations & arrays

✔ Fast operations on large datasets

✔ Array handling: np.array(), np.reshape()
✔ Math functions: np.mean(), np.std(), np.linalg
3 . PySpark – Best for big data processing

✔ Distributed data processing with Spark

✔ Handling large datasets that don’t fit in memory
✔ Functions: DataFrame.select(), groupBy(), filter()

4 .Dask – Parallel computing for large Pandas-like datasets

✔ Works like Pandas but for larger-than-memory datasets

✔ Lazy execution for optimization

16.Describe the architecture of a cloud-based data warehouse like

Snowflake or BigQuery.
A cloud-based data warehouse like Snowflake or BigQuery follows a distributed,
scalable, and serverless architecture designed for high-performance analytics. Here’s a
breakdown of their architectural components:

1 .Snowflake Architecture (3-Tier)

Snowflake follows a multi-cluster shared data architecture with three key layers:

Compute Layer (Virtual Warehouses)

• Made up of virtual warehouses (clusters) that run queries.

• Each warehouse is independent, ensuring no resource contention.

• Supports automatic scaling (up/down based on workload).

Storage Layer

• Stores structured and semi-structured data (CSV, JSON, Parquet).

• Uses columnar storage for faster queries.

• Data is compressed, encrypted, and automatically managed.

• Decoupled from compute, allowing independent scaling.

Cloud Services Layer

• Manages query optimization, authentication, access control.

• Includes metadata management for tracking table statistics.

• Handles concurrent users and workload management.

Benefits of Snowflake: ✔ Auto-scaling & auto-suspend for cost savings

✔ Supports semi-structured data (JSON, Avro, etc.)
✔ Time travel feature for recovering past versions

2 . Google BigQuery Architecture (Serverless)

BigQuery follows a serverless, columnar, and distributed architecture.

Storage Layer (Colossus)

• Uses columnar storage optimized for fast analytics.

• Supports automatic compression and partitioning.

• Data is stored in Google Cloud Storage (GCS).

Compute Layer (Dremel Execution Engine)

• Uses Dremel (Google’s query execution engine) for distributed SQL processing.

• Fully managed, auto-scaling compute.

• Queries are split into slots and executed in parallel.

Query Processing Layer

• Uses ANSI SQL with built-in machine learning (BigQuery ML).

• Supports federated queries (querying external sources like GCS, Bigtable).

• Offers BI Engine for in-memory analytics.

Benefits of BigQuery: ✔ Serverless → No infrastructure management

✔ Highly scalable and cost-effective (pay-per-query)
✔ Integration with Google AI & ML tools

Snowflake vs. BigQuery: Key Differences

Feature Snowflake ❄ BigQuery

Architecture Compute & storage separated Fully serverless

Scaling Manual & auto-scaling Auto-scaling

Columnar, optimized for Columnar, stored in Google

Storage
structured/semi-structured data Cloud

Pricing Pay for storage + compute usage Pay per query

Best for high-performance, complex Best for on-demand analytics &

Use Case
workloads real-time queries

Would you like a deeper comparison based on a specific use case?

17. What is the difference between OLAP and OLTP databases?

OLAP vs. OLTP

OLAP (Online Analytical OLTP (Online Transaction

Feature
Processing) Processing)

Used for analytical querying and Used for day-to-day transactional

Purpose
reporting operations

Typically stores large volumes of

Data Structure Stores real-time transactional data
historical, summarized data

Simple queries, mostly CRUD

Query Complex queries, multi-
(Create, Read, Update, Delete)
Complexity dimensional analysis
operations

Large datasets, often Small datasets, individual

Data Volume
aggregating data over time transaction records

Data Update Infrequent updates (batch

Frequent updates (real-time)
Frequency updates)

Data warehousing, business Banking systems, e-commerce

Examples
intelligence platforms
OLAP (Online Analytical OLTP (Online Transaction
Feature
Processing) Processing)

Performance Optimized for read-heavy Optimized for write-heavy operations

Focus operations (analysis, reports) (transactions)

Often denormalized for fast Highly normalized to reduce data

Normalization
querying redundancy

Less frequent, optimized for Frequent indexing for fast retrieval of

Indexes
read performance transaction data

Lower concurrency, heavy read High concurrency, many concurrent

Concurrency
operations transactions

18. How do you ensure data quality during ETL processes?

Ensuring data quality during ETL (Extract, Transform, Load) processes is crucial for
maintaining accuracy, consistency, and reliability. Here’s how you can achieve it:

1 . Extract Phase – Validate Incoming Data

✔ Source Validation → Ensure data is extracted from trusted sources.

✔ Schema Validation → Check column names, data types, and constraints.
✔ Data Completeness → Ensure all expected records are extracted.
✔ Deduplication → Remove duplicate records at the extraction stage.

2 . Transform Phase – Cleaning & Standardization

✔ Handling Missing Values → Use imputation (fillna() in Pandas) or flag records for review.
✔ Data Type Consistency → Convert data into the correct formats (e.g., int, float,
datetime).
✔ Outlier Detection → Identify and handle anomalies using statistical methods (e.g., Z-
score).
✔ Business Rules Enforcement → Validate data against predefined rules (e.g., age cannot
be negative).
✔ Normalization & Standardization → Convert data into a consistent format (e.g.,
lowercase emails).

3 . Load Phase – Integrity & Audits

✔ Primary Key Checks → Ensure uniqueness constraints are met.

✔ Referential Integrity → Validate foreign key relationships before inserting data.
✔ Row Count Validation → Compare the number of records before and after loading.
✔ Data Reconciliation → Cross-check transformed data against source data.

4 .Automated Quality Checks & Monitoring

✔ ETL Logging & Alerts → Capture errors in logs and send notifications for failures.
✔ Data Profiling Tools → Use Great Expectations, dbt, or Apache Griffin to track data
quality.
✔ Unit Testing → Implement test cases using pytest or unittest for data validation.

By implementing these best practices, you can minimize errors and ensure high-
quality data for analytics and decision-making!

19 .What is the role of Apache Kafka in data engineering?

Apache Kafka plays a key role in data engineering, particularly in real-time data
streaming, event-driven architectures, and data integration. Here’s an overview of its
role:

1 .Real-Time Data Streaming

Kafka is primarily used for building real-time data pipelines and streaming applications. It
allows you to process data in motion, which is crucial for handling high-throughput, low-
latency data feeds.

• Producers send data to Kafka topics.

• Consumers read from those topics in real-time.

• Useful for scenarios like IoT data, real-time analytics, and log processing.

2 .Event-Driven Architecture

Kafka enables event-driven architectures, where systems communicate through events.

It decouples data producers and data consumers, allowing each to operate
independently.

• Allows easy integration between various microservices.

• Ensures asynchronous communication and processing.

3 .Data Integration and Data Pipelines

Kafka acts as a central messaging layer in complex data engineering pipelines. It

facilitates the integration of multiple data sources, including databases, third-party
systems, and internal applications.

• Data can be sent from Kafka to data warehouses (e.g., Snowflake, BigQuery) for
batch processing.

• Integrates seamlessly with ETL tools like Apache Flink, Apache Spark, and Kafka
Streams for processing.

4 .Fault Tolerance and Scalability

Kafka provides built-in fault tolerance and scalability, ensuring high availability and
reliability for data flows.

• Data is replicated across multiple brokers for fault tolerance.

• Kafka can scale horizontally by adding more brokers to the cluster.

5 .High Throughput and Low Latency

Kafka handles high-throughput and low-latency data streams, making it ideal for
applications where speed is critical (e.g., financial transactions, recommendation
engines).
• It can handle millions of messages per second with low latency, ensuring fast
processing.

6 .Data Storage and Durability

Kafka offers durability by persisting data to disk, enabling long-term storage of messages.
Unlike traditional message queues, Kafka can retain messages for configurable retention
periods, allowing consumers to reprocess them as needed.

• Kafka’s log-based storage allows for scalable retention policies, useful for audit
logs, reprocessing data, or data archiving.

Common Use Cases of Apache Kafka:

• Log aggregation: Collect logs from various systems for centralized analysis.

• Metrics collection: Real-time metrics and monitoring of application performance.

• Real-time analytics: Real-time dashboards, fraud detection, and recommendation

systems.

• Data synchronization: Synchronizing data across various systems (databases,

applications).

In summary, Kafka is a powerful tool in data engineering for managing high-volume, real-
time, and fault-tolerant data streams across distributed systems. It plays a critical role in
building modern data architectures, particularly in streaming analytics, event sourcing,
and data integration pipelines.

20.What is ETL? Explain its phases and tools you have worked with.
ETL (Extract, Transform, Load)

ETL is a data integration process used to move data from various sources into a centralized
data warehouse or data lake. It consists of three main phases: Extract, Transform, and
Load.
1. Extract Phase

The Extract phase involves retrieving raw data from various source systems, which could
include databases, APIs, flat files, or third-party services.

Key Steps:

• Connect to Source Systems: Data is pulled from multiple sources like relational
databases, web services, cloud platforms, etc.

• Data Extraction: The raw data is captured, usually in a format like CSV, JSON, or
XML.

Tools for Extraction:

• Python libraries (e.g., pandas, requests, pyodbc) for pulling data from APIs,
databases.

• Apache Kafka for streaming real-time data.

• AWS Glue for serverless extraction from cloud storage.

• Talend for data extraction from different sources.

2 .Transform Phase

The Transform phase is where the raw data is cleaned, enriched, and converted into a
format suitable for analysis. This is the most complex phase, as it involves applying
business rules, data validation, and restructuring.

Key Steps:

• Data Cleaning: Handle missing values, duplicates, and outliers.

• Data Enrichment: Add additional data or attributes from other sources.

• Data Standardization: Convert data to a standard format (e.g., date formats,

currency conversions).

• Data Aggregation: Summarize data for analytical purposes.

• Data Validation: Ensure data integrity and consistency.

Tools for Transformation:

• Pandas for Python-based data manipulation.

• Apache Spark for large-scale data transformations.

• dbt for SQL-based transformations in data warehouses.

• Talend for visual data transformation workflows.

• Airflow for orchestrating transformation tasks.

3 .Load Phase

The Load phase involves writing the transformed data into the destination, typically a data
warehouse or data lake for further analysis.

Key Steps:

• Bulk Load: Insert transformed data in large batches.

• Incremental Load: Only insert new or updated data to improve efficiency.

• Data Indexing: Create indices to speed up query performance.

Tools for Loading:

• AWS Redshift or Google BigQuery for loading data into cloud data warehouses.

• SQL Server or Oracle Database for traditional relational databases.

• Apache Hive for storing data in Hadoop-based data lakes.

• Pandas and SQLAlchemy for Python-based data loading.

• Apache Nifi for automated ETL pipelines.

ETL Process Example:

1. Extract data from an API containing customer transaction records.

2. Transform the data by cleaning out null values, standardizing date formats, and
enriching with geographic information.

3. Load the transformed data into a PostgreSQL database or AWS Redshift for
analysis.

ETL Tools I Have Worked With:

• Python (with Pandas, NumPy, and SQLAlchemy) for handling small to medium ETL
tasks and custom transformations.

• SQL for extracting and transforming data within relational databases.

• Apache Kafka for real-time data streaming and integration.

• Apache Airflow for orchestrating ETL pipelines and scheduling tasks.

• AWS Glue for serverless ETL jobs in the cloud.

• Power BI for transforming data within its in-built tools before loading it to
dashboards.

ETL in Modern Data Engineering:

• In modern data engineering, ETL processes have become more automated, with
tools like Apache Airflow for scheduling, dbt for transformation, and cloud-based
solutions like AWS Glue or Google Cloud Dataflow for scalable data processing.

ETL is critical in building efficient data pipelines, ensuring data is clean, accurate, and
available for downstream analytics.

Artistic Thinking As Transcognitive Practice
No ratings yet
Artistic Thinking As Transcognitive Practice
12 pages
English For Advanced Learners
100% (2)
English For Advanced Learners
9 pages
Questions For Preparation
No ratings yet
Questions For Preparation
9 pages
Data Analtycs Professional-1
No ratings yet
Data Analtycs Professional-1
15 pages
HCLTech
No ratings yet
HCLTech
5 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
10 pages
Data Science Professional
No ratings yet
Data Science Professional
21 pages
Top 50 Industry-Relevant Data Analyst Interview Q - A
No ratings yet
Top 50 Industry-Relevant Data Analyst Interview Q - A
5 pages
Tiger Analytics 1735834470
No ratings yet
Tiger Analytics 1735834470
27 pages
Data Visualization Using Pyplot
No ratings yet
Data Visualization Using Pyplot
14 pages
Data Analytics & Science USING Machine Learning and AI
No ratings yet
Data Analytics & Science USING Machine Learning and AI
12 pages
Dsmlusingpython
No ratings yet
Dsmlusingpython
10 pages
Data Analyst Cheat Sheet
No ratings yet
Data Analyst Cheat Sheet
28 pages
Practical
No ratings yet
Practical
12 pages
DEBasic Test Que NAns
No ratings yet
DEBasic Test Que NAns
15 pages
Text 4
No ratings yet
Text 4
1 page
Text 3
No ratings yet
Text 3
3 pages
SQL Python PowerBI Questions and Answers
No ratings yet
SQL Python PowerBI Questions and Answers
4 pages
Question
No ratings yet
Question
6 pages
Data Science (Oct 2024)
No ratings yet
Data Science (Oct 2024)
13 pages
Data Exploration With SQL & Python
No ratings yet
Data Exploration With SQL & Python
4 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Computer Practical
No ratings yet
Computer Practical
5 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Python Viva Questions With Answers
No ratings yet
Python Viva Questions With Answers
45 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
A I Using Python
No ratings yet
A I Using Python
10 pages
Module 4
No ratings yet
Module 4
30 pages
Certified Professional Diploma in Data Analytics
No ratings yet
Certified Professional Diploma in Data Analytics
49 pages
Wipro Data Analyst Interview Questions
No ratings yet
Wipro Data Analyst Interview Questions
29 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Data Science Professional Final
No ratings yet
Data Science Professional Final
21 pages
8.1 - Prompts - For - AI - Enabled - Data - Life
No ratings yet
8.1 - Prompts - For - AI - Enabled - Data - Life
16 pages
100 Interview Questions
No ratings yet
100 Interview Questions
15 pages
2.1 Combining Data Frames
No ratings yet
2.1 Combining Data Frames
38 pages
Chapter 14 Interface Python With Mysql
No ratings yet
Chapter 14 Interface Python With Mysql
10 pages
Data Science Using Python
No ratings yet
Data Science Using Python
10 pages
Unit 6
No ratings yet
Unit 6
20 pages
Walmart Data Analyst Interview Experience
No ratings yet
Walmart Data Analyst Interview Experience
10 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
3 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Air Travel Mamagement Final Investigatory Project
No ratings yet
Air Travel Mamagement Final Investigatory Project
18 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
NTU AB0403 Quiz Notes
No ratings yet
NTU AB0403 Quiz Notes
18 pages
Screenshot 2023-12-27 at 7.05.37 PM
No ratings yet
Screenshot 2023-12-27 at 7.05.37 PM
23 pages
Unit 4 Python
No ratings yet
Unit 4 Python
52 pages
Python Programming - X: by Nimesh Kumar Dagur
No ratings yet
Python Programming - X: by Nimesh Kumar Dagur
15 pages
Module 3 Notes
No ratings yet
Module 3 Notes
45 pages
Practice Paper (IP)
No ratings yet
Practice Paper (IP)
28 pages
The Python Database API
No ratings yet
The Python Database API
9 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
InformaticsPractices SQP
No ratings yet
InformaticsPractices SQP
9 pages
IP 12th
No ratings yet
IP 12th
45 pages
12 22 23sp Informaticspractices
No ratings yet
12 22 23sp Informaticspractices
17 pages
Keshav
No ratings yet
Keshav
20 pages
Xii Ip CHD QP PB1
No ratings yet
Xii Ip CHD QP PB1
6 pages
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
No ratings yet
Interview Q & A (SQL Spark HIVE Airflow AWS Kafka) - 1
25 pages
Frequently Asked Interview Questions For Data Analyst Role
No ratings yet
Frequently Asked Interview Questions For Data Analyst Role
12 pages
Data Analytics Curriculum
No ratings yet
Data Analytics Curriculum
8 pages
Lec 16 BB
No ratings yet
Lec 16 BB
24 pages
GETTING STARTED WITH SQL: Exercises with PhpMyAdmin and MySQL
From Everand
GETTING STARTED WITH SQL: Exercises with PhpMyAdmin and MySQL
Remy Lentzner
No ratings yet
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
Ailox Users
No ratings yet
Ailox Users
9 pages
PsychologicalEssentialism EJSP2001
No ratings yet
PsychologicalEssentialism EJSP2001
20 pages
The Greek Philosophical Vocabulary 1nbsped 0715623354 9780715623350 - Compress
No ratings yet
The Greek Philosophical Vocabulary 1nbsped 0715623354 9780715623350 - Compress
174 pages
Co Prehension Lesson
No ratings yet
Co Prehension Lesson
2 pages
Studentregistration Project Proposal
No ratings yet
Studentregistration Project Proposal
7 pages
Groovy For SAP CPI Workbook
No ratings yet
Groovy For SAP CPI Workbook
4 pages
High Context and Low Context Cultures. (Baskin Robbins) : Prepared By-Saurabh Bhargava Rupa Jha Archit Tiwari
No ratings yet
High Context and Low Context Cultures. (Baskin Robbins) : Prepared By-Saurabh Bhargava Rupa Jha Archit Tiwari
17 pages
Pali A Grammar of The Language of The Theravada Tipitaka TH Oberlies Berlin 2001 600dpi Lossy PDF
No ratings yet
Pali A Grammar of The Language of The Theravada Tipitaka TH Oberlies Berlin 2001 600dpi Lossy PDF
408 pages
Family Vocabulary PPT ESL
100% (5)
Family Vocabulary PPT ESL
14 pages
Assignment 2 - Design Patterns
No ratings yet
Assignment 2 - Design Patterns
4 pages
Time-Dependent Force, Torque, or Motion Input To A Joint - MATLAB
No ratings yet
Time-Dependent Force, Torque, or Motion Input To A Joint - MATLAB
5 pages
Holiday Homework Xii History
No ratings yet
Holiday Homework Xii History
3 pages
S29Gl-N Mirrorbit™ Flash Family
No ratings yet
S29Gl-N Mirrorbit™ Flash Family
100 pages
31IJELS 109202050 Licensing PDF
No ratings yet
31IJELS 109202050 Licensing PDF
6 pages
Dakhmas of Culture Sine Sepulchro
No ratings yet
Dakhmas of Culture Sine Sepulchro
10 pages
Oral Presentation Title Defense Rubric
No ratings yet
Oral Presentation Title Defense Rubric
1 page
English A Guide To Giving Dawah To Non Muslims
100% (1)
English A Guide To Giving Dawah To Non Muslims
52 pages
OET Speaking Checklist
No ratings yet
OET Speaking Checklist
2 pages
MATHS - IIB QUESTION BANK - Chapter Wise Important Questions For IPE
No ratings yet
MATHS - IIB QUESTION BANK - Chapter Wise Important Questions For IPE
33 pages
Ontological Engineering: Delivered by Joel Anandraj.E Ap/It
No ratings yet
Ontological Engineering: Delivered by Joel Anandraj.E Ap/It
39 pages
444 Plurals Countable Uncountable Nouns Test A1 A2 Grammar Exercises
No ratings yet
444 Plurals Countable Uncountable Nouns Test A1 A2 Grammar Exercises
3 pages
Grammar Booklet
No ratings yet
Grammar Booklet
46 pages
NPS - Request Subscriber Shifting
No ratings yet
NPS - Request Subscriber Shifting
2 pages
02 Luke Cond ESV A4 PDF
No ratings yet
02 Luke Cond ESV A4 PDF
4 pages
Step Into Your Supernatural Destiny Activate The Calling On Your LIFE For Breakthrough Success (Destiny Image, Fighting... (Edwin Kim (Kim, Edwin) ) (Z-Library)
No ratings yet
Step Into Your Supernatural Destiny Activate The Calling On Your LIFE For Breakthrough Success (Destiny Image, Fighting... (Edwin Kim (Kim, Edwin) ) (Z-Library)
46 pages
Lecture No.04 Data Structures: Dr. Sohail Aslam
No ratings yet
Lecture No.04 Data Structures: Dr. Sohail Aslam
54 pages
SSS Joint Affidavit of Discrepancy DONGUINES
100% (1)
SSS Joint Affidavit of Discrepancy DONGUINES
2 pages
Jair 14714 Corr
No ratings yet
Jair 14714 Corr
35 pages

Deloitte Data Engineer Interview Experience (0-3 Yoe)

Uploaded by

Deloitte Data Engineer Interview Experience (0-3 Yoe)

Uploaded by

DELOITTE DATA ENGINEER INTERVIEW

EXPERIENCE (0-3 YoE)

1. Write a query to retrieve the top 3 highest salaries from an employee

ORDER BY salary DESC

Alternatively, if there are duplicate salaries and we need an accurate top 3:

SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk

WHERE rnk <= 3;

2. Explain the difference between a clustered and a non-clustered index.

3. What are window functions in SQL? Provide examples.

SELECT employee_id, department, salary,

• Avoid **SELECT ***; retrieve only required columns.

• Optimize JOINs by indexing keys.

• Use EXPLAIN PLAN to analyze query performance.

• Normalize the database structure.

• Avoid redundant subqueries and use CTEs.

5. Write a query to find duplicate records in a table.

GROUP BY column1, column2

HAVING COUNT(*) > 1;

6. How do you handle NULL values in SQL?

SELECT COALESCE(salary, 0) FROM employee;

• Use IS NULL / IS NOT NULL in conditions.

• Use IFNULL() (MySQL) or NVL() (Oracle).

7. Explain the difference between DELETE, TRUNCATE, and DROP.

• DROP: Removes the entire table from the database.

8. What is a CTE (Common Table Expression), and how is it different from a

SELECT customer_id, SUM(amount) AS total_sales

SELECT * FROM SalesData WHERE total_sales > 1000;

SUM(sales) OVER (ORDER BY month) AS running_total

1.Handling Missing Data

# Fill missing values with mean/median/mode

# Drop rows/columns with missing values

3. Data Type Conversion

df["numeric_column"] = pd.to_numeric(df["numeric_column"], errors="coerce")

4. String Cleaning (Removing Special Characters, Lowercasing, etc.)

df["text_column"] = df["text_column"].str.lower().str.replace(r"[^a-zA-Z0-9]", " ", regex=True)

You can remove outliers or cap them based on a threshold.

# Remove rows with outliers

df = df[df['column_name'] < threshold_value]

# Alternatively, cap outliers

df['column_name'] = df['column_name'].clip(lower=min_value, upper=max_value)

Transform data, such as changing column names or creating new features:

df.rename(columns={'old_name': 'new_name'}, inplace=True)

• Create New Columns:

df['new_column'] = df['column1'] + df['column2']

You can filter data based on certain conditions:

df_filtered = df[df['column_name'] > 50] # Rows where column_name > 50

8. Handling Categorical Data

df['encoded_column'] = df['category_column'].map({'Category1': 0, 'Category2': 1})

from sklearn.preprocessing import StandardScaler

10. Date/Time Transformation

Extract specific components from datetime columns or create new time-based

11. Save Cleaned Data

After all transformations, save the cleaned dataset.

• Handle missing data, outliers, duplicates, and incorrect data types.

• Transform dates, filter data, and create new features as needed.

12.Write a Python script to connect to a database and fetch data using

Steps in the Script:

1 . Install the required package (if not already installed):

pip install mysql-connector-python

2 .Establish a connection with the database.

# Database connection details

"host": "your_host", # e.g., "localhost" or an IP address

"user": "your_username", # e.g., "root"

print("Connected to the database!")

# Create a cursor object

# SQL query to fetch data

query = "SELECT * FROM your_table LIMIT 10;"

# Fetch and print results

for row in results:

# Close the connection

Modifications for Different Databases:

• PostgreSQL: Use psycopg2

• SQL Server: Use pyodbc

• SQLite: Use sqlite3

13 .Explain the difference between Pandas and PySpark for data

Feature Pandas PySpark

• Avoid SELECT *; retrieve only required columns.