0% found this document useful (0 votes)

1K views4 pages

DataGrokr Technical Assignment

This document provides instructions for a data engineering internship assessment at DataGrokr. Candidates will be evaluated on their SQL, Python, and Spark skills. The assessment has three parts: [1] setup a Databricks environment and load data, [2] analyze the loaded IPL dataset to answer questions using SparkSQL or PySpark, and [3] create a database class to expose the query results. Candidates are asked to complete the assessment in a single Databricks notebook by the given deadline.

Uploaded by

Sidkrish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views4 pages

DataGrokr Technical Assignment

Uploaded by

Sidkrish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Assessment for Data Engineering Internship at

DataGrokr

Thank you for your interest in the Data Engineering Internship at DataGrokr.

We anticipate the selected candidates to be working in Data Engineering and Cloud related
projects. As such for this given assignment, we’d like to test candidates’ skills in those
areas. Candidates who are already proficient in SQL, Python and Spark will have an edge
in this assignment but even if you didn’t know anything about any of these technologies
you should be able to do this assignment by following along the instructions and studying
the links provided.

Please note that this ability to learn new technologies and following instructions will be a
key skill required in your day-to-day job at DataGrokr.

What you need to do:

The objective of the assignment is to test your proficiency in querying and data
analysis. The assignment has 3 parts.

● Section 1: You will provision an environment in Databricks (a provider

of managed Spark as a Service on the Cloud) and load some data
sets.

● Section 2: Analyze the given dataset and answer the question either using
SparkSQL or using Spark(pySpark) on the Databricks cluster provisioned in Step
1.

● Section 3: Create a class to expose generated result-set from Section 2 to data

consumers.

Section 1: Environment setup and data loading.

1. Go to Databricks community cloud and create a free account to get access to a

single node Spark cluster:
a. Go to https://databricks.com/try-databricks link and select the
“COMMUNITY EDITION” (click on “get started” button).
b. Provide all details and sign up.
c. Verify your email id and select a password to get more information
d. You can watch Link video on YouTube to get started with Databricks
community cloud.
2. Load the IPL dataset into the cluster
a. Download the files needed to complete this assignment from here. Load the
dataset into the cluster. This data set contains data of IPL Matches.
b. We have sampled down the files and created a zipped file. You can
download the zipped file and extract the file.
c. Once you have downloaded the files, create a new Jupyter Notebook and
import the data into the cluster.
d. Create data frames for each of the data sets. Give proper column
names and datatypes. (refer the schema provided with the data for
reference
e. Register those dataframes as tables.
f. Please use pySpark or SparkSQL
g. If you are new to Spark, refer to the Databricks and Spark documentation
to learn about Notebooks, dataframes, loading data, etc. Below are some
links that maybe useful for you to learn spark:
i. https://docs.databricks.com/getting-started/spark/index.html
ii. https://spark.apache.org/docs/latest/api/python/

Notes: Databricks notebook supports Mix languages. You can override the
default language by specifying the language magic command %<language> at
the beginning of a cell. The supported magic commands are: %python, %SQL,
%sh, and %md. Read https://docs.databricks.com/notebooks/index.html

Section 2: Using SparkSQL or pySpark for data analysis.

1. Use SparkSQL or pySpark to analyze the data and answer the following questions.

A. Find the top 3 venues which hosted the most number of eliminator
matches?
B. Return most number of catches taken by a player in IPL history?
C. Write a query to return a report for highest wicket taker in matches
which were affected by Duckworth-Lewis’s method (D/L method).
D. Write a query to return a report for highest strike rate by a batsman in non
powerplay overs(7-20 overs)
Note: strike rate = (Total Runs scored/Total balls faced by player) *100, Make sure
that balls faced by players should be legal delivery (not wide balls or no balls) . E.
Write a query to return a report for highest extra runs in a venue (stadium, city). F.
Write a query to return a report for the cricketers with the most number of players of
the match award in neutral venues.
G. Write a query to get a list of top 10 players with the highest batting average Note:
Batting average is the total number of runs scored divided by the number of times
they have been out (Make sure to include run outs (on non-striker end) as valid
out while calculating average).
H. Write a query to find out who has officiated (as an umpire) the most
number of matches in IPL.
I. Find venue details of the match where V Kohli scored his highest individual runs in
IPL.
J. Creative Case study:
Please analyze how winning/losing tosses can impact a match and it's result?
(Bonus for Visualization here)

2. Please refer to the schema reference excel sheet provided in the zipped to understand
the relationship between the tables. Please write the queries in the same Notebook.
Section 3: Expose Data.
1. Create a database (use any relational DB preferably SQLite) and load data from the
dataset (Section 1.1) into db.
2. Create a Class Database. Class will need to have
1. Constructor to initialize dB connection and other variables
2. Methods implemented to return result-set for each query in Section 2.1,
result-set should be returned as a json/dict object.
3. Exception handling.
4. get_status method to ping database connectivity.
3. Feel free to create any additional classes or data structures you deem necessary.
4. Evaluation: Here is an input example
from database import Database
db = Database ()
qry1_result = db.get_query1_result ();
5. Please follow industry standards while writing the code and include basic schema and
data validations. [Reference 1.2]
6. Preferred Programming language –
● Python

Deliverables:

a. For Section 1, Section 2 and Section 3: A single Databricks notebook

where you have developed the code for the Section 1, Section 2 and
Section 3. Download the files and email it to us. We will run the
Databricks book on our end and correct your submissions.

Notes: Databricks notebook supports Mix languages. You can override the
default language by specifying the language magic command
%<language> at the beginning of a cell. The supported magic commands
are: %python, %SQL, %sh, and %md. Read
https://docs.databricks.com/notebooks/index.html

b. Your code will be evaluated not just on the final answers but on code
quality and unit tests.

• Follow coding standards (PEP-8)

• Appropriate error/exception handling

• Modular function design

c. Your final submission should be sent to [email protected] Your

submissions are due to us by end of day <01/02/2022> and subject should
follow following pattern <Source e.g.: Internshala, College name > Data Engineer:
<Your Full Name>

d. Your up-to-date resume, named as Firstname_Lastname.pdf

If you have any questions during the assignment, send your questions to
[email protected]

Good luck and we hope you learn something new in this process!

Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Pythons Basics
No ratings yet
Pythons Basics
104 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Aws Certified Solutions Architect Associate Saa c03 Updated Nov 2024
No ratings yet
Aws Certified Solutions Architect Associate Saa c03 Updated Nov 2024
488 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
PySpark VS SQL Interview Questions
100% (1)
PySpark VS SQL Interview Questions
16 pages
MDM 103HF1 BusinessEntityServicesGuide en
100% (1)
MDM 103HF1 BusinessEntityServicesGuide en
193 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Azure Data Factory
No ratings yet
Azure Data Factory
47 pages
Piping Design
No ratings yet
Piping Design
465 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
SQL Function Types
No ratings yet
SQL Function Types
61 pages
Data Engineering
No ratings yet
Data Engineering
92 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Dhanush Bigdata Resume Updated
No ratings yet
Dhanush Bigdata Resume Updated
9 pages
Siva
No ratings yet
Siva
4 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Ajay Resume VLaF
No ratings yet
Ajay Resume VLaF
2 pages
Data Science ML Full Stack 2022 GitHub
No ratings yet
Data Science ML Full Stack 2022 GitHub
9 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Rahul Sharma
100% (1)
Rahul Sharma
2 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Parking Allocation System
No ratings yet
Parking Allocation System
18 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Jarupula Praveen
No ratings yet
Jarupula Praveen
7 pages
Top 50 Data Warehousing Interview Questions & Answers
No ratings yet
Top 50 Data Warehousing Interview Questions & Answers
8 pages
Madhusudhan Senior Data Engineer
No ratings yet
Madhusudhan Senior Data Engineer
4 pages
Databricks
No ratings yet
Databricks
11 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Bhaskar ADE - Altimetrik
No ratings yet
Bhaskar ADE - Altimetrik
3 pages
RD SQL Notes
No ratings yet
RD SQL Notes
119 pages
AyushiPatra Resume
No ratings yet
AyushiPatra Resume
1 page
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Data-Engineering Course Structure
No ratings yet
Data-Engineering Course Structure
9 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Database Issues in Mobile-Computing PDF
No ratings yet
Database Issues in Mobile-Computing PDF
20 pages
Unit4 Datascience
No ratings yet
Unit4 Datascience
43 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Hanumantha Rao Resume-1 (4391)
No ratings yet
Hanumantha Rao Resume-1 (4391)
4 pages
Zclus - Harish - Data Engineer
No ratings yet
Zclus - Harish - Data Engineer
6 pages
Srinivas SR - Informatica Developer Summary of Qualification
No ratings yet
Srinivas SR - Informatica Developer Summary of Qualification
5 pages
Iswarya - SR - Bigdata Hadoop Developer
No ratings yet
Iswarya - SR - Bigdata Hadoop Developer
8 pages
Adithya Jatangi: Professional Summary
No ratings yet
Adithya Jatangi: Professional Summary
7 pages
Senior Data Engineer Resume Example
No ratings yet
Senior Data Engineer Resume Example
1 page
Azure Data Enginner
No ratings yet
Azure Data Enginner
8 pages
Srikanth M - Data Engineer
No ratings yet
Srikanth M - Data Engineer
5 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Mohit BigData 5yr
100% (1)
Mohit BigData 5yr
3 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Batch 2 Daraga Done Cross Matched As of March 11 2024
No ratings yet
Batch 2 Daraga Done Cross Matched As of March 11 2024
165 pages
Ssis Rpa The Only Technology Available For Business Process Automation?
No ratings yet
Ssis Rpa The Only Technology Available For Business Process Automation?
55 pages
Big Data CH 1
No ratings yet
Big Data CH 1
62 pages
CCFP4 0-RDBMSAssignments
No ratings yet
CCFP4 0-RDBMSAssignments
64 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Koustav BigData Resume
No ratings yet
Koustav BigData Resume
2 pages
Bank Project Final
No ratings yet
Bank Project Final
32 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Big Data Engineer Interview Questions
No ratings yet
Big Data Engineer Interview Questions
1 page
ETL QA Sample Scenario V3
100% (2)
ETL QA Sample Scenario V3
3 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Lecture 13 Correlation Chapter 12 Part 1
No ratings yet
Lecture 13 Correlation Chapter 12 Part 1
20 pages
DBMS Complete Notes
No ratings yet
DBMS Complete Notes
47 pages
Database Management Systems: Name:Bharti ROLL NO:19570073 Course:Bsc (Hons) Computer Science
No ratings yet
Database Management Systems: Name:Bharti ROLL NO:19570073 Course:Bsc (Hons) Computer Science
11 pages
Cbse - Department of Skill Education Curriculum For Session 2020-2021
No ratings yet
Cbse - Department of Skill Education Curriculum For Session 2020-2021
9 pages
Cambridge IGCSE: 0417/11 Information and Communication Technology
No ratings yet
Cambridge IGCSE: 0417/11 Information and Communication Technology
16 pages
Final Exam Name
No ratings yet
Final Exam Name
13 pages
Open Catalog Interface (OCI) - Manual For Open Icecat XML and Full Icecat XML
No ratings yet
Open Catalog Interface (OCI) - Manual For Open Icecat XML and Full Icecat XML
27 pages
What Is Metadata & Why Is It Important? - NetSuite
No ratings yet
What Is Metadata & Why Is It Important? - NetSuite
19 pages
Gate Pyq Test 07 - 2020 - With Solution
No ratings yet
Gate Pyq Test 07 - 2020 - With Solution
28 pages
Abhinav D Java
No ratings yet
Abhinav D Java
6 pages
WT Unit-4 One Shot Notes Programs by Brevilearning YT
No ratings yet
WT Unit-4 One Shot Notes Programs by Brevilearning YT
10 pages
MNIST Handwritten Digit Recognition With Different CNN Architectures
No ratings yet
MNIST Handwritten Digit Recognition With Different CNN Architectures
4 pages
Advanced SQL
No ratings yet
Advanced SQL
8 pages
SPMS Skill Updation - User Guide Reference
No ratings yet
SPMS Skill Updation - User Guide Reference
9 pages
GOFLDMTAND5CA1D160437621 ETicket
No ratings yet
GOFLDMTAND5CA1D160437621 ETicket
2 pages
KNIME Versus Alteryx
No ratings yet
KNIME Versus Alteryx
4 pages
How Should I Continue My Learning Journey On The UiPath Academy - With Links
No ratings yet
How Should I Continue My Learning Journey On The UiPath Academy - With Links
2 pages
Tutorial 1 - Answers.
No ratings yet
Tutorial 1 - Answers.
7 pages
DBMS Module5 QuestionBank
No ratings yet
DBMS Module5 QuestionBank
2 pages
SAS Platform Administration For SAS9 (A00 250)
No ratings yet
SAS Platform Administration For SAS9 (A00 250)
5 pages
106
No ratings yet
106
4 pages
Government of India, Ministry of External Affairs: Service Required
No ratings yet
Government of India, Ministry of External Affairs: Service Required
2 pages
Saurabh Shrivastava: E Mail:-Objective
No ratings yet
Saurabh Shrivastava: E Mail:-Objective
2 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet

DataGrokr Technical Assignment

Uploaded by

DataGrokr Technical Assignment

Uploaded by

Assessment for Data Engineering Internship at

What you need to do:

● Section 1: You will provision an environment in Databricks (a provider

● Section 3: Create a class to expose generated result-set from Section 2 to data

Section 1: Environment setup and data loading.

1. Go to Databricks community cloud and create a free account to get access to a

Section 2: Using SparkSQL or pySpark for data analysis.

a. For Section 1, Section 2 and Section 3: A single Databricks notebook

• Follow coding standards (PEP-8)

• Appropriate error/exception handling

• Modular function design

c. Your final submission should be sent to [email protected] Your

d. Your up-to-date resume, named as Firstname_Lastname.pdf

You might also like