Report Zazmic Inc. Senior Middle Data Engineer Hiring Test AWS Snowflake Databricks Python SQL Kalgaonkarsiddhesh
Report Zazmic Inc. Senior Middle Data Engineer Hiring Test AWS Snowflake Databricks Python SQL Kalgaonkarsiddhesh
75.2 Siddhesh Kalgaonkar PDF generated at: 13 May 2024 08:18:27 UTC
Score
75.2% • 225 / 300
scored in Zazmic Inc. Senior/Middle+ Data Engineer Hiring Test (AWS, Snowflake, Databricks, Python, SQL) in 40 min 2 sec
on 13 May 2024 10:36:26 EEST
Candidate Information
Email [email protected]
Test Zazmic Inc. Senior/Middle+ Data Engineer Hiring Test (AWS, Snowflake, Databricks, Python, SQL)
Invited by Stanislav
Skill Distribution
SQL
1 100%
Basic
Databricks
2 0%
Basic
3 Snowflake 50%
Databricks
4 50%
Intermediate
AWS
5 100%
Basic
AWS
6 0%
Advanced
Python
7 10%
Basic
Python
8 95%
Intermediate
SQL
9 100%
Intermediate
Tags Distribution
SQL 100% Easy 25%
Spark 0% Jobs 0%
Union 100%
Questions
Time
Status No. Question Skill Score
Taken
Performance Tuning 21
4 Snowflake 5/5
Multiple Choice sec
1 min
Data Warehousing
7 18 Snowflake 0/5
Multiple Choice
sec
1 min
AWS: Optimize Cloud Storage
10 11 AWS (Advanced) 0/5
Multiple Choice
sec
1 min
Querying Data
13 42 Snowflake 0/5
Multiple Choice
sec
3 min
Python: Fancy tuple Python
16 37 75/75
Coding (Intermediate)
sec
13
Stop Words (hackerrank) min
17 Python (Basic) 4/50
Coding 56
sec
3 min
Counting Votes (hackerrank) SQL
18 32 50/50
DbRank (Intermediate)
sec
6 min
Count of Blood Groups (hackerrank) SQL
19 42 50/50
DbRank (Intermediate)
sec
Question description
Candidate's Solution
No comments.
Multiple Choice Databricks Databricks Workflow Data Engineering Data Visualization Data Analysis Easy
Question description
User A is working as a data engineer for a fintech startup. The compliance department has made a policy to hide
personal information in order to meet the GDPR (General Data Protection Regulation).
Which of the following approaches is most appropriate to ensure data privacy and security while processing sensitive
data in Databricks clusters in order to meet the compliance requirements?
Interviewer guidelines
Fine-grained access controls provide the ability to restrict access to sensitive data at a granular level. This ensures that
only authorized individuals or processes have access to the sensitive data within the Databricks environment. The
User A can then enforce data privacy and security by granting permissions only to the necessary personnel and
preventing unauthorized access to sensitive information.
Candidate's Solution
No comments.
3. In SQL, what's the difference between an inner join versus an outer join? Correct
Multiple Choice
Question description
In SQL, what's the difference between an inner join versus an outer join?
Candidate's Solution
Inner join includes only rows present in either table but not present in both tables. Outer join
Inner join includes only rows present in both tables. Outer join includes all rows present in either
Inner join includes only rows present in both tables. Outer join includes all rows from the first table
excluding the rows from the second table.
Inner join includes only rows present in both tables. Outer join includes all rows from both tables.
No comments.
Question description
Which of the following Snowflake features can be used to boost the efficiency of complex queries with several joins and
subqueries?
Interviewer guidelines
Precomputed views, called materialised views, enable for quicker access to data by storing the results of a query in a
table-like format. By generating precomputed views of the data used in the queries, materialised views can be used to
enhance the efficiency of complex queries with several joins and subqueries. This lessens the requirement for
processing the same data again and enhances query performance.
Candidate's Solution
external functions, which allow users to run custom code outside of Snowflake and integrate the
data sharing, which allows for the sharing of data between different Snowflake accounts
materialized views, which create precomputed views of complex queries for faster access to data
No comments.
Multiple Choice Databricks Azure Data Factory Data Engineering Azure Datalake Storage Medium
Question description
A data engineer needs to ingest data in Databricks after every five minutes for the next sixty days. The engineer has
decided to use the advanced feature Auto Loader instead of COPY INTO.
Interviewer guidelines
Auto Loader is specifically designed for structured streaming workloads in Databricks. It is optimized to handle
continuous and incremental data ingestion, making it well-suited for scenarios where data needs to be ingested every
five minutes for a long duration. By leveraging structured streaming capabilities, Auto Loader processes the incoming
data in multiple batches, allowing for efficient loading and handling of re-uploaded data. This means that if a portion
of the data has already been ingested in a previous batch, Auto Loader will only process and load the new or modified
data, minimizing redundant processing and improving performance
Candidate's Solution
Autoloader is made for structured streaming. Also, it splits the processing into multiple batches
Autoloader is expensive for large data but provides better primitives around schema inference and
evolution.
Autoloader provides the additional layer of security which supersedes COPY INTO.
Auto Loader provides an additional protective layer, but COPY INTO does not.
No comments.
6. What data structure is often referred to as "Last in, first out"? Correct
Multiple Choice
Question description
Candidate's Solution
Stack
Queue
List
Set
No comments.
Question description
A healthcare organization that has recently migrated its data warehouse to Snowflake wants to design the schema
for storing patient records including patient demographics, medical history, and medication orders.
Which of the following Snowflake schema design options is most suitable in this situation?
Interviewer guidelines
Snowflake does not require primary keys, although they are still a good idea for indexing and data structure. In this
instance, the patient id is the ideal primary key because it is an exclusive identifier for every patient. The ideal schema
design is Option 2, which uses the patient id as the primary key without adding any extra indexing or partitioning that
can slow down queries.
Candidate's Solution
VARCHAR(50), last_name VARCHAR(50), dob DATE, sex VARCHAR(10), height FLOAT, weight FLOAT,
blood_type VARCHAR(10), smoker BOOLEAN, drinker BOOLEAN PRIMARY KEY (patient_id) );</code>
</pre>
VARCHAR(50), last_name VARCHAR(50), dob DATE, sex VARCHAR(10), height FLOAT, weight FLOAT,
blood_type VARCHAR(10), smoker BOOLEAN, drinker BOOLEAN PRIMARY KEY (patient_id, dob) );
</code></pre>
VARCHAR(50), last_name VARCHAR(50), dob DATE, sex VARCHAR(10), height FLOAT, weight FLOAT,
blood_type VARCHAR(10), smoker BOOLEAN, drinker BOOLEAN PRIMARY KEY (dob, patient_id) );
</code></pre>
No comments.
Question description
This diagram shows how a typical user logs into AWS using IAM credentials and accesses various services using roles
and group permissions.
Candidate's Solution
Groups
Roles
Users
All of these
No comments.
9. At the command line, what command sends the contents of a file to standard out? Correct
Multiple Choice
Question description
At the command line, what command sends the contents of a file to standard out?
Candidate's Solution
less
grep
cat
echo
No comments.
Question description
A video streaming service aims to expand its presence globally. With a vast and growing volume of high-definition
videos and the demand for low-latency streaming, the company's IT team seeks to optimize its cloud storage solution.
They prioritize rapid content delivery, durability of data, cost-effectiveness, and the capability of analytics and machine
learning operations on this data. Which storage configuration best meets these requirements?
Interviewer guidelines
Implementing Amazon S3 Standard storage class ensures quick access to video content. Integrating with Amazon
Redshift Spectrum offers efficient analytics directly on the S3 data, making it suitable for future machine learning
operations. While in Option 1, the use of Kinesis Video Streams focuses more on real-time video analytics. While
analytics is part of the requirements, the scenario also mentions the need for future machine learning operations on
the stored data
Candidate's Solution
Use Amazon S3 Standard storage class for the video content. Implement Amazon Kinesis Video
Store video content in Amazon EFS (Elastic File System) with provisioned throughput mode. Use AWS
Implement Amazon S3 Standard storage class for quick access and integrate with Amazon Redshift
Spectrum for analytics and machine learning operations.
Use Amazon S3 Standard storage class with Amazon CloudFront for low-latency content delivery.
No comments.
11. What HTTP request method/methods should not affect the state on the server? Correct
Multiple Choice
Question description
What HTTP request method/methods should not affect the state on the server?
Candidate's Solution
GET
POST
PUT
PATCH
No comments.
Question description
A data analysis project involves two tables: inventory and orders. The inventory table has the following columns.
Table1: inventory
2471 43 Television
3478 54 Refrigerator
7265 89 Laptop
6370 20 e-reader
The orders table has the following columns: order_id, customer_id, order_item, and others.
Table2: orders
Query: with recursive orderdb as (select i.item_no,i.quantity,o.cust_id, o.order_id from inventory i join orders o on
i.name=o.order_item) select * from orderdb where quantity=(select max(quantity) from orderdb)
Interviewer guidelines
The query starts by creating a recursive common table expression (CTE) named "orderdb". The CTE uses a JOIN
operation to combine the 'inventory' and 'orders' tables, and it returns the item number, quantity, customer_id, and
order_id for each order item.
Candidate's Solution
No comments.
Question description
A retail company using Snowflake for data warehousing wants to analyze the sales performance of its different
product categories across multiple stores.
Which of the following SQL queries is the most efficient and accurate way to retrieve the total sales revenue for each
product category across all stores?
Interviewer guidelines
First, selects all distinct store IDs and then retrieve the total sales revenue for each product category across those
stores. This avoids grouping by individual store IDs, which could lead to duplicate calculations.
Candidate's Solution
product_category;</p>
product_category;</p>
No comments.
Multiple Choice Spark Databricks Jobs Azure Datalake Azure Workflow Medium
Question description
User A is responsible for managing and monitoring 12 production jobs that transfer data to Azure Data Lake (ADL)
which are utilized by the data science team at a large e-commerce store. One day, User A noticed that Job 6 took much
longer than expected, resulting in delays for subsequent jobs and disrupting the team's workflow.
What should User A do to prevent such issues from occurring in the future?
Interviewer guidelines
In order to avoid delays in subsequent jobs due to slow or stalled jobs, User A should create new shared or task-
scoped clusters. These clusters ensure that each job runs in a completely isolated environment, preventing resource
contention and allowing jobs to run independently. Option 1, terminating the current job and manually running the
next job, is not a scalable solution and may result in unnecessary downtime. Option 2, including data quality checks
before each job, is important but does not directly address the issue of job delays. Option 4, increasing the number of
worker nodes, may be helpful but is not as effective as creating new clusters for each job. Therefore, the correct
answer is option 3.
Candidate's Solution
Immediately terminate the current job and run the next job manually.
Include data quality checks before each job to prevent such problems.
Create new shared or task-scoped clusters to ensure each job runs in an isolated environment.
No comments.
15. Which of the following statements are true about Python 3 vs Python 2? Partially correct
Question description
Which of the following statements are true about Python 3 vs Python 2? (More than one)
Candidate's Solution
No comments.
Question description
Your implementations of the class will be tested by a provided code stub on several input files. Each input file
contains parameters to test your implementation with. First, the provided code stub initializes the instance of
FancyTuple. Next, it tests the implementation by accessing its elements and checking its length. The result of their
executions will be printed to the standard output by the provided code.
Input from stdin will be processed as follows and passed to the function.
The first line has the integer n, the number of elements in the tuple.
The next n lines contain the elements of the tuple.
The following line has the integer q, the number of operations to be performed on a FancyTuple instance.
The next q lines contain the standalone operations to be performed on the FancyTuple.
SAMPLE CASE 0
Sample Input 0
STDIN Function
----- --------
3 → n=3
dog → first item
cat → second item
mouse → third item
6 → q=6
first → first function...
second
third
fourth
fifth
len
Sample Output 0
dog
cat
mouse
AttributeError
AttributeError
3
Explanation 0
The code initializes t = FancyTuple("dog", "cat", "mouse"). Then, there are 6 operations to be performed:
1. t.first returns "dog" because "dog" is the first element of the tuple.
2. t.second returns "cat" because "cat" is the second element of the tuple.
3. t.third returns "mouse" because "mouse" is the third element of the tuple.
4. t.fourth raises the AttributeError exception because the tuple has 3 elements.
5. t.fifth raises the AttributeError exception because the tuple has 3 elements.
6. len(t) returns 3 because the tuple has 3 elements.
Interviewer guidelines
Setter's solution:
class FancyTuple():
@property
def first(self):
return str(self.values.first)
@property
def second(self):
return self.values.second
@property
def third(self):
return self.values.third
@property
def fourth(self):
return self.values.fourth
@property
def fifth(self):
return self.values.fifth
def __len__(self):
return len(self.values)
1 #!/bin/python3
2
3 import math
4 import os
5 import random
6 import re
7 import sys
8
9
10
11 class FancyTuple:
12 def __init__(self, *args):
13 self._args = args
14
15 def __getattr__(self, name):
16 if name in ("first", "second", "third", "fourth", "fifth"):
17 try:
18 return self._args[{"first": 0, "second": 1, "third": 2, "fourth": 3,
"fifth": 4}[name]]
19 except IndexError:
20 raise AttributeError(name)
21 raise AttributeError(name)
22
23 def __len__(self):
24 return len(self._args)
25
26
27 if __name__ == '__main__':
28 fptr = open(os.environ['OUTPUT_PATH'], 'w')
29
30 n = int(input())
31 items = [input() for _ in range(n)]
32
33 t = FancyTuple(*items)
34
35 q = int(input())
36 for _ in range(q):
37 command = input()
38 if command == "len":
39 fptr.write(str(len(t)) + "\n")
40 else:
41 try:
42 elem = getattr(t, command)
43 except AttributeError:
44 fptr.write("AttributeError\n")
45 else:
46 fptr.write(elem + "\n")
47 fptr.close()
No comments.
Question description
In NLP, stop words are commonly used words like "a", "is", and "the". They are typically filtered out during processing.
Implement a function that takes a string text and an integer k, and returns the list of words that occur in the text at
least k times. The words must be returned in the order of their first occurrence in the text.
Example
text = "a mouse is smaller than a dog but a dog is stronger"
k=2
The list of stop words that occur at least k = 2 times is ["a", "is", "dog"]. "a" occurs 3 times, "is" and "dog" both occur 2
times. No other word occurs at least 2 times. The answer is in order of first appearance in text.
Function Description
Complete the function stop_words in the editor below.
Returns
string[]: the list of stop words in order of their first occurrence
Constraints
text has at most 50000 characters.
Every character in text is either an English lowercase letter or a space.
text starts and ends with a letter. No two consecutive characters are spaces, i.e. text is a valid sentence.
There will be at least one stop word in the text.
1 ≤ k ≤ 18
Input from stdin will be processed as follows and passed to the function.
SAMPLE CASE 0
Sample Input 0
STDIN Function
----- --------
the brown fox jumps over the brown dog and runs away to a brown house → text
2 →k
Sample Output 0
the
brown
Explanation 0
"the" occurs 2 times and "brown" occurs 3 times. These words are returned in the order of their first occurrence in
the text.
SAMPLE CASE 1
Sample Input 1
Sample Output 1
foo
Explanation 1
"foo" occurs 3 times.
Interviewer guidelines
Setter's solution:
1 #!/bin/python3
2
3 import math
4 import os
5 import random
6 import re
7 import sys
8
9
10 #
11 # Complete the 'stop_words' function below.
12 #
13 # The function is expected to return a STRING_ARRAY.
14 # The function accepts following parameters:
15 # 1. STRING text
16 # 2. INTEGER k
17 #
18
19 def stopWords(text, k):
20 words = text.lower().split()
21 word_counts = {}
22 for i, word in enumerate(words):
23 if i > 0 and words[i-1] == word:
24 word_counts[words[i-1]] += 1
25 else:
26 word_counts[word] = 1
27
28 frequent_words = []
29
30 for word, count in word_counts.items():
31 if count >= k:
32 frequent_words.append(word)
33
34 return frequent_words
35
36 if __name__ == '__main__':
37 fptr = open(os.environ['OUTPUT_PATH'], 'w')
38
39 text = input()
40
41 k = int(input().strip())
42
43 result = stopWords(text, k)
44
45 fptr.write('\n'.join(result))
46 fptr.write('\n')
47
48 fptr.close()
49
No comments.
DbRank Medium Simple Joins Interviewer Guidelines SQL Database Hash Map Sub-Queries
Question description
Given a database of a votes won by different candidates in an election, find the number of votes won by female
candidates whose age is less than 50.
SCHEMA
Candidates
Results
Candidates
1 M 55 Democratic
2 M 51 Democratic
3 F 49 Democratic
4 M 60 Republic
5 F 61 Republic
6 F 48 Republic
Results
1 1 847529
1 4 283409
2 2 293841
2 5 394385
3 3 429084
3 6 303890
Expected Output:
732974
Explanation:
There are three female candidates contesting the election. Two of them are less than 50 years old. The sum of
their votes is 429084 + 303890 = 732974.
Interviewer guidelines
SOLUTION
MySQL solution
SELECT SUM(votes)
FROM (SELECT id
FROM candidates
WHERE gender = 'F' AND age <= 50) AS r
JOIN results res
ON r.id = res.candidate_id;
1 /*
No comments.
DbRank Medium Interviewer Guidelines SQL Database Hash Map Sub-Queries Union
Question description
A blood bank maintains two tables - DONOR, with information about the people who are willing to donate blood and
ACCEPTOR, with information about the people who are in need of blood. The bank wants to know the number of
males and the number of females with a particular blood group.
TABLE SCHEMA
'
DONOR
ACCEPTOR
SAMPLE CASE 0
DONOR
5 MICHAEL M Warne, NH A+ 1
ACCEPTOR
1 LINDA F Warne, NH A+ 9
5 PATRICIA F Warne, NH A+ 5
Sample Output
F A+ 3
F AB+ 3
M A+ 1
M A- 1
M AB+ 2
Explanation
There are 3 females with A+ blood group.
Similary, there are 3 females with AB+ blood group. And so on.
Interviewer guidelines
SOLUTION
MySQL solution
1 /*
2 Enter your query below.
3 Please append a semicolon ";" at the end of the query
4 */
5
6 select gender, bg, count(*) from (select gender,bg from donor
7 union ALL
8 select gender, bg from acceptor)a group by gender, bg;
No comments.