ETL (Extract, Transform, Load)
ETL (Extract, Transform, Load)
Based on what you've shared, here's a comprehensive breakdown of the topics you need to
prepare for the Endava AI Academy technical interview, focusing on ETL, data normalization,
machine learning models, anagram checking, and Sudoku solving, with Python as your main
language. I’ll split this into theoretical concepts and practical coding problems, with examples and
preparation tips.
What is ETL?
Tools commonly used (e.g., Apache Airflow, Talend, Pandas for lightweight ETL)
Concepts like:
Data cleaning
Data transformation
Data warehousing
Python Practice:
Example:
python
import pandas as pd
# Extract
df = pd.read_csv('data.csv')
# Transform
df = df.dropna()
df['salary'] = df['salary'].apply(lambda x: float(x.replace('$', '')))
# Load
df.to_csv('cleaned_data.csv', index=False)
2. Data Normalization
Theory:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/43
Why is it important in ML?
Types of normalization:
Min-Max Scaling
Z-score (StandardScaler)
Python Practice:
Use `sklearn.preprocessing`
python
Linear Regression
Logistic Regression
Decision Trees
KNN
KMeans
Overfitting/underfitting
Cross-validation
Python Practice:
python
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/43
model = LogisticRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, preds))
Example:
python
python
Backtracking algorithm
Constraints: each number 1-9 must appear once per row, column, and 3x3 grid
Python Code:
python
def solve_sudoku(board):
for row in range(9):
for col in range(9):
if board[row][col] == 0:
for num in range(1, 10):
if is_valid(board, row, col, num):
board[row][col] = num
if solve_sudoku(board):
return True
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/43
board[row][col] = 0
return False
return True
Brush up on Python basics: list/dict comprehensions, sets, lambda functions, error handling, and
modules like `collections`, `math`, `re`.
Would you like a personalized mock interview question set or a quiz to test yourself on these topics?
Excellent! Let's dive deep into everything you need to know for the Endava AI Academy technical
interview — including theory, key algorithms, code patterns, and use cases. This guide is structured
by topic and will cover all relevant methods and algorithms that could realistically be asked in an
interview of this nature.
Load: Store the final data into a database, data warehouse, or data lake.
Remove duplicates
Fill or drop missing values
Convert data types (e.g., string to float)
Normalize data
Feature engineering
Filtering rows/columns
ETL vs ELT (ELT loads data first, then transforms it inside the warehouse)
Data Lake vs Data Warehouse
2. Data Normalization
2.1. Why Normalize?
python
Linear Regression
Logistic Regression
Classification (binary/multi-class)
Outputs probabilities
Good baseline model
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/43
Decision Tree
Easy to interpret
Random Forest
K-Means Clustering
Dimensionality reduction
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/43
python
4. Anagram Checker
4.1. Method 1: Sort and Compare
python
python
python
Given a 9x9 board with some cells filled, fill the empty cells with digits 1–9 such that each row, column,
and 3x3 sub-grid contains each digit once.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/43
Steps:
4. If stuck, backtrack
python
def solve_sudoku(board):
for row in range(9):
for col in range(9):
if board[row][col] == 0:
for num in range(1, 10):
if is_valid(board, row, col, num):
board[row][col] = num
if solve_sudoku(board):
return True
board[row][col] = 0
return False
return True
List/dict/set comprehensions
`zip()`, `enumerate()`
Exception handling
Big-O notation
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/43
Sorting: O(n log n)
Would you like flashcards, a cheat sheet PDF, or a mock test to practice these topics?
Absolutely — let's break down each topic in depth with a solid theoretical foundation, key algorithms,
use cases, and Python examples. This will be your comprehensive study guide for the Endava AI
Academy technical interview.
Web scraping
Python Tools:
python
import pandas as pd
df = pd.read_csv("file.csv")
OR
python
import requests
response = requests.get("https://api.example.com/data")
data = response.json()
🔷 1.2 Transform
Goal: Clean and prepare data for analysis or storage.
Transformations include:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/43
Cleaning: Remove nulls, fix typos
Normalization/Standardization
Example:
python
df['age'] = df['age'].fillna(df['age'].mean())
df['salary'] = df['salary'].str.replace('$', '').astype(float)
🔷 1.3 Load
Goal: Store transformed data into a target system.
Targets:
python
df.to_csv("cleaned_data.csv", index=False)
✅ 2. Data Normalization
🔷 Why?
Algorithms assume features are on similar scales.
Prevents dominance of high-scale features.
🔷 Types of Normalization:
2.1. Min-Max Scaling
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/43
python
python
2.3. RobustScaler
2.4. MaxAbsScaler
python
Classification.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/43
Outputs probability using sigmoid.
python
🔷 Cross-Validation
Use `KFold` or `StratifiedKFold` to split data.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/43
python
🔷 Python ML Template
python
✅ 4. Anagram Checker
🔷 Definition
Two strings are anagrams if they contain the same characters in any order.
🔷 Methods
Method 1: Sort and Compare
python
python
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/43
count[c] -= 1
return True
🔷 Constraints
9x9 board
Each number 1–9 appears only once in:
Each row
Each column
Each 3x3 grid
🔷 Algorithm Steps
1. Loop through board
2. Find empty cell
🔷 Python Implementation
python
def solve_sudoku(board):
for i in range(9):
for j in range(9):
if board[i][j] == 0:
for num in range(1, 10):
if is_valid(board, i, j, num):
board[i][j] = num
if solve_sudoku(board):
return True
board[i][j] = 0
return False
return True
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/43
✅ 6. Python Concepts to Know
🔷 Data Structures
Lists, Tuples, Dictionaries, Sets
List comprehension
`zip()`, `enumerate()`
Slicing
🔷 OOP Basics
Classes, inheritance, methods
python
class Animal:
def __init__(self, name): self.name = name
def speak(self): return "I am " + self.name
🔷 Exception Handling
python
try:
val = int("abc")
except ValueError:
print("Conversion failed")
🔷 Useful Modules
`collections`: Counter, defaultdict
`math`: ceil, floor, sqrt
Searching
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/43
Recursion
✅ 8. Big-O Notation
Algorithm Time Complexity
Linear Search O(n)
Binary Search O(log n)
Yes — absolutely! While Backtracking is crucial (especially for problems like Sudoku, N-Queens,
permutations, etc.), there are several other core algorithmic paradigms that are essential to know for
any technical interview. These often form the backbone of competitive programming and problem-
solving in interviews, especially for roles involving AI, data, or software engineering.
Use Cases:
Activity selection
Huffman encoding
Fractional knapsack
Dijkstra’s shortest path (with priority queue)
python
python
Use Cases:
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/43
high = mid - 1
return -1
python
✅ 5. Graph Algorithms
Core Concepts:
Example: DFS
python
✅ 6. Recursion
Concept: A function calls itself to solve smaller instances.
Use Cases:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/43
Tree traversals
Permutations, combinations
✅ 7. Sorting Algorithms
You don't need to implement them from scratch in interviews often, but understanding time complexity
and behavior is critical.
python
import heapq
heap = []
heapq.heappush(heap, 4)
heapq.heappush(heap, 2)
smallest = heapq.heappop(heap)
Set/unset/toggle bit
Count bits: `bin(x).count("1")`
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/43
Summary Table of Algorithm Categories:
Type Examples Use Cases
Divide & Conquer Merge Sort, Binary Search Recursion with merge
Sliding Window Substring problems Optimal subarrays
Graph BFS, DFS, Dijkstra Networks, traversal
Would you like example problems and solutions for any of these algorithm types?
Great! Let's walk through example problems and solutions for the most important algorithm
categories you'll need to know for your interview.
✅ 1. Backtracking
🔹 Problem: Solve the N-Queens Problem
Goal: Place N queens on an N×N chessboard so no two queens attack each other.
python
def solve_n_queens(n):
res = []
board = [-1] * n
def backtrack(row):
if row == n:
res.append(board[:])
return
for col in range(n):
if is_valid(row, col):
board[row] = col
backtrack(row + 1)
backtrack(0)
return res
✅ 2. Greedy
🔹 Problem: Activity Selection
Goal: Choose the maximum number of non-overlapping intervals.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/43
python
def activity_selection(activities):
activities.sort(key=lambda x: x[1]) # sort by end time
last_end = 0
count = 0
for start, end in activities:
if start >= last_end:
count += 1
last_end = end
return count
✅ 3. Dynamic Programming
🔹 Problem: 0/1 Knapsack
python
def merge_sort(arr):
if len(arr) <= 1:
return arr
mid = len(arr)//2
left = merge_sort(arr[:mid])
right = merge_sort(arr[mid:])
return merge(left, right)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/43
✅ 5. Sliding Window
🔹 Problem: Longest Substring Without Repeating Characters
python
def longest_unique_substring(s):
seen = {}
left = 0
max_len = 0
for right, ch in enumerate(s):
if ch in seen and seen[ch] >= left:
left = seen[ch] + 1
seen[ch] = right
max_len = max(max_len, right - left + 1)
return max_len
✅ 6. Graph Algorithms
🔹 Problem: BFS for Shortest Path in Unweighted Graph
python
while queue:
path = queue.popleft()
node = path[-1]
if node == goal:
return path
✅ 7. Trie
🔹 Problem: Implement a basic Trie
python
class TrieNode:
def __init__(self):
self.children = {}
self.end = False
class Trie:
def __init__(self):
self.root = TrieNode()
import heapq
✅ 9. Bit Manipulation
🔹 Problem: Find single number where every other number appears twice
python
def single_number(nums):
result = 0
for num in nums:
result ^= num
return result
Would you like a practice set with explanations for each algorithm type, or maybe a mock interview
format next?
Certainly! Here's a curated list of 3 interview-style problems per algorithm, tailored to AI, Data
Engineering, and Data Science contexts. Each includes a brief description and time complexity analysis.
🧠 1. Backtracking
1.1. Sudoku Solver (AI)
Problem: Fill a 9x9 Sudoku grid so that each row, column, and 3x3 subgrid contains all digits from
1 to 9.
Time Complexity: O(9⁸²) in the worst case, but practical solutions are faster due to pruning.
Use Case: Constraint satisfaction problems in AI.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/43
Problem: Given a 2D board and a word, check if the word exists in the grid by sequentially adjacent
letters.
Time Complexity: O(N * 3^L), where N is the number of cells and L is the word length.
Use Case: Pattern recognition in text data.
Problem: Determine if there's a subset of a given set with a sum equal to a target value.
⚡ 2. Greedy Algorithms
2.1. Huffman Coding (AI)
Problem: Build a prefix code (Huffman code) for a set of characters based on their frequencies.
Time Complexity: O(n log n), where n is the number of unique characters.
Use Case: Data compression in AI applications.
Problem: Select the maximum number of activities that don't overlap, given their start and end
times.
Problem: Schedule jobs to maximize profit, each with a deadline and profit, ensuring only one job
at a time.
Time Complexity: O(n log n), with n being the number of jobs.
Use Case: Optimizing batch processing tasks.
🔁 3. Dynamic Programming
3.1. Longest Common Subsequence (AI)
Problem: Find the length of the longest subsequence common to two sequences.
Time Complexity: O(m * n), where m and n are the lengths of the sequences.
Use Case: Comparing model outputs in NLP tasks.AlgoCademy
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/43
Problem: Determine the maximum value that can be put in a knapsack of capacity W, with given
weights and values.
Time Complexity: O(n * W), where n is the number of items.
Use Case: Resource allocation in data storage. Medium+7
Problem: Compute the minimum number of operations to convert one string to another.
Time Complexity: O(m * n), with m and n being the lengths of the strings.
Use Case: Data cleaning and preprocessing.
📏 5. Sliding Window
5.1. Maximum Sum Subarray of Size K (Data Engineering)
Problem: Find the length of the longest substring without repeating characters.
Time Complexity: O(n), using a sliding window.
Use Case: Analyzing text data.LinkedIn
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/43
5.3. Minimum Window Substring (AI)
Problem: Find the smallest substring containing all characters of a given string.
Time Complexity: O(n), with n being the length of the main string.
Use Case: Information retrieval tasks.
🌐 6. Graph Algorithms
6.1. Dijkstra's Algorithm (Data Engineering)
Problem: Find the shortest path from a source to all vertices in a weighted graph.
Time Complexity: O((V + E) log V), using a priority queue.
Use Case: Optimizing data flow in networks.
🌲 7. Trie
7.1. Implement Trie (Prefix Tree) (Data Science)
Problem: Given a list of products and a search word, return a list of product suggestions as the
user types each character.
Time Complexity: O(m^2), with m being the length of the search word.
Use Case: Autocomplete features.
Problem: Replace words in a sentence with their root forms from a dictionary.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/43
Time Complexity: O(n * k), where n is the number of words and k is the average word length.
Use Case: Data normalization.
⚙️ 9. Bit Manipulation
9.1. Single Number (Data Engineering)
Problem: Find the element that appears once when every other element appears twice.
Time Complexity: O(n).
Problem: For every number i in the range 0 ≤ i ≤ n, calculate the number of 1's in their binary
representation.
Sources
Absolutely — here is the continuation for Bit Manipulation problems and their contexts:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/43
Problem: For every number `i` in the range `0 ≤ i ≤ n`, calculate the number of 1's in their binary
representation.
Time Complexity: O(n)
Use Case: Useful in feature engineering (e.g., for creating binary feature sets or understanding
Hamming weights in data compression).
Would you like a printable cheat sheet or a set of mock coding problems based on these categories
next?
Certainly! Below are Python implementations for each of the problems mentioned across different
algorithms. Let's dive into the code!
🧠 1. Backtracking
1.1. Sudoku Solver
python
def solve_sudoku(board):
def is_valid(board, row, col, num):
for i in range(9):
if board[row][i] == num or board[i][col] == num:
return False
row_start, col_start = 3 * (row // 3), 3 * (col // 3)
for i in range(row_start, row_start + 3):
for j in range(col_start, col_start + 3):
if board[i][j] == num:
return False
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/43
return True
def solve(board):
for i in range(9):
for j in range(9):
if board[i][j] == 0:
for num in range(1, 10):
if is_valid(board, i, j, num):
board[i][j] = num
if solve(board):
return True
board[i][j] = 0
return False
return True
solve(board)
return board
⚡ 2. Greedy Algorithms
2.1. Huffman Coding
python
import heapq
from collections import Counter
class Node:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def huffman_coding(s):
frequency = Counter(s)
priority_queue = [Node(char, freq) for char, freq in frequency.items()]
heapq.heapify(priority_queue)
heapq.heappush(priority_queue, merged)
return codebook
root = priority_queue[0]
return generate_codes(root)
🔁 3. Dynamic Programming
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/43
3.1. Longest Common Subsequence
python
return dp[m][n]
python
def merge_sort(arr):
if len(arr) <= 1:
return arr
mid = len(arr) // 2
left = merge_sort(arr[:mid])
right = merge_sort(arr[mid:])
📏 5. Sliding Window
5.1. Maximum Sum Subarray of Size K
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 30/43
return max_sum
🌐 6. Graph Algorithms
6.1. Dijkstra's Algorithm
python
import heapq
while pq:
current_dist, node = heapq.heappop(pq)
return dist
🌲 7. Trie
7.1. Implement Trie
python
class TrieNode:
def __init__(self):
self.children = {}
self.end_of_word = False
class Trie:
def __init__(self):
self.root = TrieNode()
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 31/43
for char in prefix:
if char not in node.children:
return False
node = node.children[char]
return True
python
import heapq
def merge_k_sorted_lists(lists):
min_heap = []
for i, lst in enumerate(lists):
if lst:
heapq.heappush(min_heap, (lst[0], i, 0)) # (value, list index, element index)
result = []
while min_heap:
val, list_idx, ele_idx = heapq.heappop(min_heap)
result.append(val)
return result
⚙️ 9. Bit Manipulation
9.1. Single Number
python
def single_number(nums):
result = 0
for num in nums:
result ^= num
return result
Conclusion
The above Python implementations cover the key problems based on each algorithm type (Backtracking,
Greedy, Dynamic Programming, etc.). The time complexities for each algorithm have been designed to
optimize performance, with many of them leveraging space-time tradeoffs and efficient algorithms like
heaps, dynamic programming tables, and sorting.
If you'd like to run through some example inputs or further explanations, let me know!
Great question! No, anagram checking is not explicitly covered in the problems above, but it is a
common coding interview problem and can be related to string manipulation algorithms, which are
crucial for data science and AI applications. It can be a basic problem in the context of checking
equality between two strings, but let's dive into how it could be implemented using Python!
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 32/43
Anagram Checking
Two strings are anagrams if they can be rearranged to form each other, meaning they have the same
characters with the same frequencies.
There are multiple ways to check if two strings are anagrams:
1. Using Sorting
The idea is to sort both strings and check if they are equal. This approach has a time complexity of O(n
log n).
python
# Example usage:
print(are_anagrams("listen", "silent")) # True
print(are_anagrams("hello", "world")) # False
Another approach is to count the frequency of each character in both strings and compare the
frequencies. This approach is more efficient and runs in O(n) time complexity.
python
# Example usage:
print(are_anagrams("listen", "silent")) # True
print(are_anagrams("hello", "world")) # False
If you know that the strings consist of only lowercase letters (e.g., `'a'` to `'z'`), you can use a fixed-
size array (of length 26) to store character counts. This approach is very efficient with O(n) time
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 33/43
complexity and O(1) space complexity (since the space is fixed regardless of string length).
python
# Example usage:
print(are_anagrams("listen", "silent")) # True
print(are_anagrams("hello", "world")) # False
Conclusion
The anagram checking problem is not explicitly listed in the algorithms I provided above, but it is an
important problem related to string manipulation and could certainly show up in an interview.
If you have any more questions or need additional variations of this problem, feel free to ask!
To properly prepare for a coding interview at the intersection of AI, Data Science, and Data
Engineering, it's essential to have a solid understanding of various coding problems and concepts.
Below is a comprehensive list of key coding interview problems categorized by AI, Data Science, and
Data Engineering that you should be familiar with. These problems are commonly asked in technical
interviews and can test your proficiency in algorithms, data structures, and problem-solving.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 34/43
3. Backtracking
Problem: Solve problems like Sudoku Solver, N-Queens, Word Search.
Importance: Backtracking is useful for solving constraint satisfaction problems and
optimization problems, such as in puzzles and games.
4. Greedy Algorithms
7. Decision Trees
Problem: Implement a Decision Tree for classification.
Importance: Understanding decision trees is crucial for building interpretable machine
learning models and also for tree-based algorithms like Random Forest and Gradient
Boosting.
8. Neural Networks
Problem: Implement a basic neural network from scratch (using backpropagation and
gradient descent).
Importance: Vital for understanding the core of deep learning algorithms and training neural
networks in AI.
9. Bayes Theorem & Probability
Problem: Naive Bayes Classifier.
Importance: Understanding probability and Bayes' theorem is essential for many
probabilistic AI models.
10. Optimization Problems
Problem: Solve problems like Gradient Descent, Simulated Annealing, Genetic
Algorithms.
Importance: Optimization techniques are fundamental to training machine learning models,
particularly deep learning models.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 35/43
📊 Data Science
1. Data Cleaning and Transformation
Problem: Clean and preprocess datasets (e.g., handling missing data, normalization,
categorical encoding).
Importance: Data cleaning is a critical first step in any data science project.
Importance: Statistics is key for analyzing data, interpreting model results, and
understanding distributions.
5. Regression Problems
Problem: Solve problems like Linear Regression, Ridge, Lasso, and Polynomial Regression.
Importance: Regression models are central to data science for predicting continuous
variables.
6. Classification Problems
Problem: Implement models for Logistic Regression, K-Nearest Neighbors, Support
Vector Machines, and Naive Bayes.
Importance: Classification algorithms are fundamental for labeling data into categories.
7. Time Series Analysis
Problem: Work with ARIMA, SARIMA, or Exponential Smoothing models.
Importance: Time series forecasting is crucial for predicting trends in stock prices, weather,
and sales.
8. Clustering
Problem: Implement clustering algorithms like K-Means, DBSCAN, and Hierarchical
Clustering.
Importance: Clustering helps identify natural groupings within data, useful for
segmentation, customer profiling, and anomaly detection.
9. Dimensionality Reduction
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 36/43
Problem: Apply techniques like Principal Component Analysis (PCA) and t-SNE.
Importance: Dimensionality reduction is key to simplifying data for visualization and
improving model performance.
10. Recommendation Systems
Problem: Build recommendation systems using techniques like Collaborative Filtering and
Content-Based Filtering.
Importance: Recommendation systems are core components in e-commerce, entertainment,
and content platforms.
🛠️ Data Engineering
1. ETL (Extract, Transform, Load)
Problem: Implement basic ETL processes to move and transform data between systems.
Importance: ETL is crucial for creating data pipelines and data integration in Data
Engineering.
2. Database Design
Problem: Design normalized databases and write SQL queries for data retrieval, joins,
subqueries, and aggregations.
Importance: Efficient database design and optimization are key in Data Engineering.
Importance: Streaming data is essential for real-time data analysis and systems like
recommendation engines or fraud detection.
5. Data Warehousing
Problem: Design and optimize a Data Warehouse and use tools like Google BigQuery,
Amazon Redshift, or Snowflake.
Importance: Data warehouses are essential for large-scale storage and analytical processing
of data.
6. NoSQL Databases
Problem: Work with MongoDB, Cassandra, or Redis for unstructured data storage.
Importance: NoSQL databases are essential for handling non-relational, large-scale, and
unstructured data.
7. Cloud Computing
Problem: Implement and manage data pipelines on AWS, GCP, or Azure.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 37/43
Importance: Cloud computing platforms are integral for scalable and cost-efficient data
processing and storage.
8. Data Pipeline Automation
Problem: Automate data pipelines using Airflow, Luigi, or Prefect.
Importance: Automating data pipelines ensures efficient, repeatable, and reliable data
processing.
9. Data Sharding and Partitioning
Problem: Design strategies for sharding and partitioning large datasets across distributed
systems.
Importance: Crucial for scaling systems and improving the performance of large databases.
10. Caching
Conclusion
To succeed in your interview preparation, focusing on these areas will help you develop a broad and
deep understanding of both algorithmic problem-solving and practical applications in AI, Data
Science, and Data Engineering. Work on the coding problems listed above, understand their
implementations, and practice solving them in real interview settings.
If you'd like, I can help you dive deeper into any of these topics with example problems, solutions, or
additional explanations! Just let me know.
Great! Let's continue with more detailed examples for each of the key problem areas. I’ll focus on
providing detailed explanations, Python implementations, and time complexities for the problems
mentioned earlier.
python
while queue:
node = queue.popleft()
if node not in visited:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 38/43
visited.add(node)
for neighbor in graph[node]:
queue.append(neighbor)
return visited
# Example usage:
graph = {
'A': ['B', 'C'],
'B': ['A', 'D', 'E'],
'C': ['A', 'F'],
'D': ['B'],
'E': ['B', 'F'],
'F': ['C', 'E']
}
print(bfs(graph, 'A')) # {'A', 'B', 'C', 'D', 'E', 'F'}
Time Complexity: O(V + E) where V is the number of vertices and E is the number of edges.
Use Case: AI algorithms, particularly for pathfinding, scheduling, and graph traversal.
2. Dynamic Programming
python
return dp[m][n]
# Example usage:
print(longest_common_subsequence("abcde", "ace")) # 3
Time Complexity: O(m * n) where m and n are the lengths of the input strings.
Use Case: Used in AI for sequence alignment, text comparison, and other NLP tasks.
3. Backtracking
python
def solve_nqueens(n):
def is_safe(board, row, col):
for i in range(row):
if board[i] == col or board[i] - i == col - row or board[i] + i == col + row:
return False
return True
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 39/43
return
for col in range(n):
if is_safe(board, row, col):
board[row] = col
solve(board, row + 1)
board[row] = -1
solutions = []
solve([-1] * n, 0)
return solutions
# Example usage:
print(solve_nqueens(4)) # Multiple solutions
Time Complexity: O(N!) for the backtracking solution (due to recursion over all permutations).
Use Case: Useful in AI for constraint satisfaction problems, like game AI and puzzle-solving.
4. Greedy Algorithms
python
total_value = 0
for r, w, v in ratio:
if capacity >= w:
capacity -= w
total_value += v
else:
total_value += r * capacity
break
return total_value
# Example usage:
weights = [10, 20, 30]
values = [60, 100, 120]
capacity = 50
print(fractional_knapsack(weights, values, capacity)) # 240.0
📊 Data Science
1. Data Cleaning and Transformation
python
import pandas as pd
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 40/43
# Create a sample dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, None, 30, 22],
'City': ['New York', None, 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)
Time Complexity: O(n), where n is the number of rows (for filling missing data).
Use Case: Essential for preprocessing data before training machine learning models.
2. Regression Problems
python
import numpy as np
# Example usage:
X = np.array([[1], [2], [3], [4]]) # Feature (e.g., hours studied)
y = np.array([1, 2, 3, 4]) # Target (e.g., scores)
theta = linear_regression(X, y)
print(theta)
Time Complexity: O(n²) for matrix inversion (due to dot product operations).
Use Case: Regression is a core concept in data science for predictive modeling.
3. Clustering
python
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# KMeans Clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_) # Labels of each point
print(kmeans.cluster_centers_) # Cluster centers
Time Complexity: O(n * k * i) where n is the number of points, k is the number of clusters, and i is
the number of iterations.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 41/43
Use Case: Clustering is a popular technique in data science for customer segmentation, anomaly
detection, and unsupervised learning tasks.
🛠️ Data Engineering
1. ETL (Extract, Transform, Load)
python
import pandas as pd
# Extract
data = pd.read_csv('input_data.csv')
# Transform
data['Age'] = data['Age'].fillna(data['Age'].mean()) # Fill missing values
# Load
data.to_csv('transformed_data.csv', index=False)
Time Complexity: Depends on the I/O operations and data processing steps.
Use Case: Common in building data pipelines to integrate and transform data for analysis or
machine learning models.
2. SQL Queries
sql
Time Complexity: O(n * m) where n and m are the number of rows in the tables being joined.
Use Case: SQL joins are essential for combining data from multiple tables, which is a common task
in data engineering for building data warehouses.
3. Data Streaming
python
# Initialize producer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 42/43
# Send message to Kafka topic
producer.send('my_topic', b'Hello, Kafka!')
producer.flush()
Conclusion
By working through these problems, you'll build a strong foundation across various fields like AI, Data
Science, and Data Engineering. The key areas of focus are:
Data Science: Techniques for data manipulation, statistical analysis, and machine learning model
implementation.
Data Engineering: Core competencies in building efficient data pipelines, querying databases, and
handling large-scale data systems.
These skills will help you not only pass coding interviews but also work efficiently in real-world data
engineering, AI, and data science environments. Would you like further clarification or more example
problems in any of these areas?
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 43/43