DataGrokr Technical Assignment
DataGrokr Technical Assignment
DataGrokr
Thank you for your interest in the Data Engineering Internship at DataGrokr.
We anticipate the selected candidates to be working in Data Engineering and Cloud related
projects. As such for this given assignment, we’d like to test candidates’ skills in those
areas. Candidates who are already proficient in SQL, Python and Spark will have an edge
in this assignment but even if you didn’t know anything about any of these technologies
you should be able to do this assignment by following along the instructions and studying
the links provided.
Please note that this ability to learn new technologies and following instructions will be a
key skill required in your day-to-day job at DataGrokr.
The objective of the assignment is to test your proficiency in querying and data
analysis. The assignment has 3 parts.
● Section 2: Analyze the given dataset and answer the question either using
SparkSQL or using Spark(pySpark) on the Databricks cluster provisioned in Step
1.
Notes: Databricks notebook supports Mix languages. You can override the
default language by specifying the language magic command %<language> at
the beginning of a cell. The supported magic commands are: %python, %SQL,
%sh, and %md. Read https://docs.databricks.com/notebooks/index.html
1. Use SparkSQL or pySpark to analyze the data and answer the following questions.
A. Find the top 3 venues which hosted the most number of eliminator
matches?
B. Return most number of catches taken by a player in IPL history?
C. Write a query to return a report for highest wicket taker in matches
which were affected by Duckworth-Lewis’s method (D/L method).
D. Write a query to return a report for highest strike rate by a batsman in non
powerplay overs(7-20 overs)
Note: strike rate = (Total Runs scored/Total balls faced by player) *100, Make sure
that balls faced by players should be legal delivery (not wide balls or no balls) . E.
Write a query to return a report for highest extra runs in a venue (stadium, city). F.
Write a query to return a report for the cricketers with the most number of players of
the match award in neutral venues.
G. Write a query to get a list of top 10 players with the highest batting average Note:
Batting average is the total number of runs scored divided by the number of times
they have been out (Make sure to include run outs (on non-striker end) as valid
out while calculating average).
H. Write a query to find out who has officiated (as an umpire) the most
number of matches in IPL.
I. Find venue details of the match where V Kohli scored his highest individual runs in
IPL.
J. Creative Case study:
Please analyze how winning/losing tosses can impact a match and it's result?
(Bonus for Visualization here)
2. Please refer to the schema reference excel sheet provided in the zipped to understand
the relationship between the tables. Please write the queries in the same Notebook.
Section 3: Expose Data.
1. Create a database (use any relational DB preferably SQLite) and load data from the
dataset (Section 1.1) into db.
2. Create a Class Database. Class will need to have
1. Constructor to initialize dB connection and other variables
2. Methods implemented to return result-set for each query in Section 2.1,
result-set should be returned as a json/dict object.
3. Exception handling.
4. get_status method to ping database connectivity.
3. Feel free to create any additional classes or data structures you deem necessary.
4. Evaluation: Here is an input example
from database import Database
db = Database ()
qry1_result = db.get_query1_result ();
5. Please follow industry standards while writing the code and include basic schema and
data validations. [Reference 1.2]
6. Preferred Programming language –
● Python
Deliverables:
Notes: Databricks notebook supports Mix languages. You can override the
default language by specifying the language magic command
%<language> at the beginning of a cell. The supported magic commands
are: %python, %SQL, %sh, and %md. Read
https://docs.databricks.com/notebooks/index.html
b. Your code will be evaluated not just on the final answers but on code
quality and unit tests.
Good luck and we hope you learn something new in this process!