0% found this document useful (0 votes)
1K views4 pages

DataGrokr Technical Assignment

This document provides instructions for a data engineering internship assessment at DataGrokr. Candidates will be evaluated on their SQL, Python, and Spark skills. The assessment has three parts: [1] setup a Databricks environment and load data, [2] analyze the loaded IPL dataset to answer questions using SparkSQL or PySpark, and [3] create a database class to expose the query results. Candidates are asked to complete the assessment in a single Databricks notebook by the given deadline.

Uploaded by

Sidkrish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views4 pages

DataGrokr Technical Assignment

This document provides instructions for a data engineering internship assessment at DataGrokr. Candidates will be evaluated on their SQL, Python, and Spark skills. The assessment has three parts: [1] setup a Databricks environment and load data, [2] analyze the loaded IPL dataset to answer questions using SparkSQL or PySpark, and [3] create a database class to expose the query results. Candidates are asked to complete the assessment in a single Databricks notebook by the given deadline.

Uploaded by

Sidkrish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assessment for Data Engineering Internship at

DataGrokr

Thank you for your interest in the Data Engineering Internship at DataGrokr.

We anticipate the selected candidates to be working in Data Engineering and Cloud related
projects. As such for this given assignment, we’d like to test candidates’ skills in those
areas. Candidates who are already proficient in SQL, Python and Spark will have an edge
in this assignment but even if you didn’t know anything about any of these technologies
you should be able to do this assignment by following along the instructions and studying
the links provided.

Please note that this ability to learn new technologies and following instructions will be a
key skill required in your day-to-day job at DataGrokr.

What you need to do:

The objective of the assignment is to test your proficiency in querying and data
analysis. The assignment has 3 parts.

● Section 1: You will provision an environment in Databricks (a provider


of managed Spark as a Service on the Cloud) and load some data
sets.

● Section 2: Analyze the given dataset and answer the question either using
SparkSQL or using Spark(pySpark) on the Databricks cluster provisioned in Step
1.

● Section 3: Create a class to expose generated result-set from Section 2 to data


consumers.

Section 1: Environment setup and data loading.

1. Go to Databricks community cloud and create a free account to get access to a


single node Spark cluster:
a. Go to https://databricks.com/try-databricks link and select the
“COMMUNITY EDITION” (click on “get started” button).
b. Provide all details and sign up.
c. Verify your email id and select a password to get more information
d. You can watch Link video on YouTube to get started with Databricks
community cloud.
2. Load the IPL dataset into the cluster
a. Download the files needed to complete this assignment from here. Load the
dataset into the cluster. This data set contains data of IPL Matches.
b. We have sampled down the files and created a zipped file. You can
download the zipped file and extract the file.
c. Once you have downloaded the files, create a new Jupyter Notebook and
import the data into the cluster.
d. Create data frames for each of the data sets. Give proper column
names and datatypes. (refer the schema provided with the data for
reference
e. Register those dataframes as tables.
f. Please use pySpark or SparkSQL
g. If you are new to Spark, refer to the Databricks and Spark documentation
to learn about Notebooks, dataframes, loading data, etc. Below are some
links that maybe useful for you to learn spark:
i. https://docs.databricks.com/getting-started/spark/index.html
ii. https://spark.apache.org/docs/latest/api/python/

Notes: Databricks notebook supports Mix languages. You can override the
default language by specifying the language magic command %<language> at
the beginning of a cell. The supported magic commands are: %python, %SQL,
%sh, and %md. Read https://docs.databricks.com/notebooks/index.html

Section 2: Using SparkSQL or pySpark for data analysis.

1. Use SparkSQL or pySpark to analyze the data and answer the following questions.

A. Find the top 3 venues which hosted the most number of eliminator
matches?
B. Return most number of catches taken by a player in IPL history?
C. Write a query to return a report for highest wicket taker in matches
which were affected by Duckworth-Lewis’s method (D/L method).
D. Write a query to return a report for highest strike rate by a batsman in non
powerplay overs(7-20 overs)
Note: strike rate = (Total Runs scored/Total balls faced by player) *100, Make sure
that balls faced by players should be legal delivery (not wide balls or no balls) . E.
Write a query to return a report for highest extra runs in a venue (stadium, city). F.
Write a query to return a report for the cricketers with the most number of players of
the match award in neutral venues.
G. Write a query to get a list of top 10 players with the highest batting average Note:
Batting average is the total number of runs scored divided by the number of times
they have been out (Make sure to include run outs (on non-striker end) as valid
out while calculating average).
H. Write a query to find out who has officiated (as an umpire) the most
number of matches in IPL.
I. Find venue details of the match where V Kohli scored his highest individual runs in
IPL.
J. Creative Case study:
Please analyze how winning/losing tosses can impact a match and it's result?
(Bonus for Visualization here)

2. Please refer to the schema reference excel sheet provided in the zipped to understand
the relationship between the tables. Please write the queries in the same Notebook.
Section 3: Expose Data.
1. Create a database (use any relational DB preferably SQLite) and load data from the
dataset (Section 1.1) into db.
2. Create a Class Database. Class will need to have
1. Constructor to initialize dB connection and other variables
2. Methods implemented to return result-set for each query in Section 2.1,
result-set should be returned as a json/dict object.
3. Exception handling.
4. get_status method to ping database connectivity.
3. Feel free to create any additional classes or data structures you deem necessary.
4. Evaluation: Here is an input example
from database import Database
db = Database ()
qry1_result = db.get_query1_result ();
5. Please follow industry standards while writing the code and include basic schema and
data validations. [Reference 1.2]
6. Preferred Programming language –
● Python

Deliverables:

a. For Section 1, Section 2 and Section 3: A single Databricks notebook


where you have developed the code for the Section 1, Section 2 and
Section 3. Download the files and email it to us. We will run the
Databricks book on our end and correct your submissions.

Notes: Databricks notebook supports Mix languages. You can override the
default language by specifying the language magic command
%<language> at the beginning of a cell. The supported magic commands
are: %python, %SQL, %sh, and %md. Read
https://docs.databricks.com/notebooks/index.html

b. Your code will be evaluated not just on the final answers but on code
quality and unit tests.

• Follow coding standards (PEP-8)

• Appropriate error/exception handling

• Modular function design

c. Your final submission should be sent to [email protected] Your


submissions are due to us by end of day <01/02/2022> and subject should
follow following pattern <Source e.g.: Internshala, College name > Data Engineer:
<Your Full Name>

d. Your up-to-date resume, named as Firstname_Lastname.pdf


If you have any questions during the assignment, send your questions to
[email protected]

Good luck and we hope you learn something new in this process!

You might also like