Deloitte Data Engineer Interview Experience (0-3 Yoe)
Deloitte Data Engineer Interview Experience (0-3 Yoe)
FROM employee
LIMIT 3;
SELECT salary
FROM (
FROM employee
) ranked_salaries
• Non-Clustered Index: Creates a separate structure to store index data. A table can
have multiple non-clustered indexes.
FROM employee;
4. How would you optimize a query that takes too long to execute?
• Use Indexes on frequently queried columns.
FROM table_name
• TRUNCATE: Removes all rows, faster than DELETE, cannot be rolled back.
WITH SalesData AS (
FROM sales
GROUP BY customer_id
9. Write a query to calculate the running total of sales for each month.
SELECT month, sales,
FROM sales_table;
10. Explain the difference between INNER JOIN, LEFT JOIN, and FULL
OUTER JOIN.
• INNER JOIN: Returns only matching records.
• LEFT JOIN: Returns all records from the left table and matching records from the
right.
• FULL OUTER JOIN: Returns all records from both tables, filling non-matching
records with NULLs. Example:
11 . How would you use Python for data cleaning and transformation?
Python is widely used for data cleaning and transformation in data analysis and
machine learning workflows. You can use libraries like pandas, NumPy, and re to perform
various data preparation tasks efficiently.
import pandas as pd
df = pd.read_csv("data.csv")
# Check for missing values
print(df.isnull().sum())
df["column_name"].fillna(df["column_name"].mean(), inplace=True)
df.dropna(inplace=True)
2. Removing Duplicates
df.drop_duplicates(inplace=True)
df["date_column"] = pd.to_datetime(df["date_column"])
5. Handling Outliers
• Rename Columns:
7. Filtering Data
Convert categorical data to numerical values using encoding techniques like Label
Encoding or One-Hot Encoding.
# Label Encoding
# One-Hot Encoding
df = pd.get_dummies(df, columns=['category_column'])
9. Normalization/Standardization
You may need to scale numerical data for machine learning models.
scaler = StandardScaler()
df['scaled_column'] = scaler.fit_transform(df[['column_name']])
df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day_of_week'] = df['date_column'].dt.dayofweek
df.to_csv("cleaned_data.csv", index=False)
Summary
• Pandas is your main tool for data manipulation, cleaning, and transformation.
• Use encoding and scaling techniques for categorical and numerical features.
Python Script:
import mysql.connector
"password": "your_password",
"database": "your_database"
try:
# Establishing connection
conn = mysql.connector.connect(**db_config)
if conn.is_connected():
cursor = conn.cursor()
cursor.execute(query)
results = cursor.fetchall()
print(row)
except mysql.connector.Error as err:
print(f"Error: {err}")
finally:
if conn.is_connected():
cursor.close()
conn.close()
print("Connection closed.")
Best For Small to medium-sized datasets Big data & distributed computing
When to Use?
You can surround each ETL step (Extract, Transform, Load) with try-except blocks to catch
specific errors and take appropriate actions like logging the error or retrying the process.
import logging
import time
# Setup logging
try:
logging.info("Extracting data...")
if not data:
return data
except Exception as e:
def transform(data):
try:
logging.info("Transforming data...")
return transformed_data
except Exception as e:
def load(data):
try:
if not data:
except Exception as e:
def run_etl():
try:
data = extract()
transformed_data = transform(data)
load(transformed_data)
except Exception as e:
if __name__ == "__main__":
while True:
try:
run_etl()
except Exception as e:
logging.info("Retrying in 10 seconds...")
time.sleep(10) # Retry after 10 seconds if it fails
1. Logging: Use logging to capture detailed logs for debugging. It provides different
levels like INFO, ERROR, WARNING, etc.
3. Retries: If a step fails (e.g., network issue), retry it with a delay (time.sleep).
4️. Raising Exceptions: After catching and logging an exception, raise it again if you
want the pipeline to fail and stop, or handle it at a higher level.
5. Data Validation: Before moving to the next ETL step, check that the data is valid
(e.g., non-empty).
• Integrate with a message queue (e.g., RabbitMQ) for reprocessing failed steps.
15.What libraries have you used for data processing in Python (e.g.,
Pandas, NumPy)?
1 . Pandas – Best for structured data (CSV, Excel, SQL tables)
Snowflake follows a multi-cluster shared data architecture with three key layers:
Storage Layer
• Uses Dremel (Google’s query execution engine) for distributed SQL processing.
✔ Handling Missing Values → Use imputation (fillna() in Pandas) or flag records for review.
✔ Data Type Consistency → Convert data into the correct formats (e.g., int, float,
datetime).
✔ Outlier Detection → Identify and handle anomalies using statistical methods (e.g., Z-
score).
✔ Business Rules Enforcement → Validate data against predefined rules (e.g., age cannot
be negative).
✔ Normalization & Standardization → Convert data into a consistent format (e.g.,
lowercase emails).
✔ ETL Logging & Alerts → Capture errors in logs and send notifications for failures.
✔ Data Profiling Tools → Use Great Expectations, dbt, or Apache Griffin to track data
quality.
✔ Unit Testing → Implement test cases using pytest or unittest for data validation.
By implementing these best practices, you can minimize errors and ensure high-
quality data for analytics and decision-making!
Kafka is primarily used for building real-time data pipelines and streaming applications. It
allows you to process data in motion, which is crucial for handling high-throughput, low-
latency data feeds.
2 .Event-Driven Architecture
• Data can be sent from Kafka to data warehouses (e.g., Snowflake, BigQuery) for
batch processing.
• Integrates seamlessly with ETL tools like Apache Flink, Apache Spark, and Kafka
Streams for processing.
Kafka provides built-in fault tolerance and scalability, ensuring high availability and
reliability for data flows.
Kafka handles high-throughput and low-latency data streams, making it ideal for
applications where speed is critical (e.g., financial transactions, recommendation
engines).
• It can handle millions of messages per second with low latency, ensuring fast
processing.
Kafka offers durability by persisting data to disk, enabling long-term storage of messages.
Unlike traditional message queues, Kafka can retain messages for configurable retention
periods, allowing consumers to reprocess them as needed.
• Kafka’s log-based storage allows for scalable retention policies, useful for audit
logs, reprocessing data, or data archiving.
• Log aggregation: Collect logs from various systems for centralized analysis.
In summary, Kafka is a powerful tool in data engineering for managing high-volume, real-
time, and fault-tolerant data streams across distributed systems. It plays a critical role in
building modern data architectures, particularly in streaming analytics, event sourcing,
and data integration pipelines.
20.What is ETL? Explain its phases and tools you have worked with.
ETL (Extract, Transform, Load)
ETL is a data integration process used to move data from various sources into a centralized
data warehouse or data lake. It consists of three main phases: Extract, Transform, and
Load.
1. Extract Phase
The Extract phase involves retrieving raw data from various source systems, which could
include databases, APIs, flat files, or third-party services.
Key Steps:
• Connect to Source Systems: Data is pulled from multiple sources like relational
databases, web services, cloud platforms, etc.
• Data Extraction: The raw data is captured, usually in a format like CSV, JSON, or
XML.
• Python libraries (e.g., pandas, requests, pyodbc) for pulling data from APIs,
databases.
2 .Transform Phase
The Transform phase is where the raw data is cleaned, enriched, and converted into a
format suitable for analysis. This is the most complex phase, as it involves applying
business rules, data validation, and restructuring.
Key Steps:
3 .Load Phase
The Load phase involves writing the transformed data into the destination, typically a data
warehouse or data lake for further analysis.
Key Steps:
• AWS Redshift or Google BigQuery for loading data into cloud data warehouses.
2. Transform the data by cleaning out null values, standardizing date formats, and
enriching with geographic information.
3. Load the transformed data into a PostgreSQL database or AWS Redshift for
analysis.
• Power BI for transforming data within its in-built tools before loading it to
dashboards.
• In modern data engineering, ETL processes have become more automated, with
tools like Apache Airflow for scheduling, dbt for transformation, and cloud-based
solutions like AWS Glue or Google Cloud Dataflow for scalable data processing.
ETL is critical in building efficient data pipelines, ensuring data is clean, accurate, and
available for downstream analytics.