0% found this document useful (0 votes)

575 views5 pages

DataGrokr Technical Assignment - Data Engineering - Internshala

This document provides instructions for an assessment for a Data Engineering internship at DataGrokr. It consists of 3 sections: 1) setting up PySpark in Colab and loading data, 2) analyzing the loaded data to answer questions using PySpark, and 3) creating a database class to expose the results. The candidate is asked to complete the tasks in a single Colab notebook, following best practices for coding standards, error handling, and modularity. The completed notebook should be submitted by a deadline to be considered for the internship.

Uploaded by

Vinutha M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

575 views5 pages

DataGrokr Technical Assignment - Data Engineering - Internshala

Uploaded by

Vinutha M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assessment for DE Internship at DataGrokr

Thank you for your interest in the Data Engineering Internship at DataGrokr.

We anticipate the selected candidates to be working in Data Engineering and Cloud-related projects.
As such for this given assignment, we’d like to test candidates’ skills in those areas. Candidates who
are already proficient in SQL, Python, and Spark will have an edge in this assignment but even if you
didn’t know anything about any of these technologies you should be able to do this assignment by
following along the instructions and studying the links provided.

Please note that this ability to learn new technologies and follow instructions will be a key skill
required in your day-to-day job at DataGrokr.

What you need to do:

The objective of the assignment is to test your proficiency in querying and data analysis. The
assignment has 3 parts.

Section 1: Setting up PySpark in Colab and loading some data sets.

Section 2: Analyze the given dataset and answer the question using Spark(pySpark).
Section 3: Create a class to expose generated result-set from Section 2 to data consumers.

Section 1: Environment setup and data loading

1. Open this link to create a new Colab notebook (You need to sign in to your google account if
not signed). Follow the below steps to setup spark in your notebook

Spark is written in the Scala programming language and requires the Java Virtual
Machine (JVM) to run. Therefore, our first task is to download Java.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Next, we will download and unzip Apache Spark with Hadoop 2.7 to install it.

!wget -q https://archive.apache.org/dist/spark/spark-
3.1.2/spark-3.1.2-bin-hadoop2.7.tgz

!tar xf spark-3.1.2-bin-hadoop2.7.tgz

Setup Environment variables for Java and Spark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-
amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-
hadoop2.7"

Then we need to install and import the 'findspark' library that will locate Spark on the
system and import it as a regular library.

!pip install -q findspark

import findspark
findspark.init()

Now, import SparkSession from pyspark.sql and create a SparkSession, which is the
entry point to Spark.

from pyspark.sql import SparkSession

spark = (SparkSession
.builder
.appName("<app_name>")
.getOrCreate())

2. Download the files needed to complete this assignment from here and upload them in
content directory of colab notebook.

3. Create dataframes for each of the datasets. Give proper column names and datatypes. (Refer
the schema provided with the data for reference) Check out spark.read.format from here to
create spark dataframe

4. Your SparkSession has an attribute called catalog which lists all the data inside the cluster.
This attribute has a few methods for extracting different pieces of information.
One of the most useful is the .listTables() method, which returns the names of all the tables in
your cluster as a list.
Register those dataframes as tables using .createOrReplaceTempView() , this method registers
the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed
from the specific SparkSession used to create the Spark DataFrame. To read about it refer to
this link

5. You can use pyspark dataframe functions like .select() or you can use .sql() to provide the sql
query inside it to solve the problem. Check out this link for more info.
6. Check out this pyspark documention if you are new to it.

7. Check out the diagram to see all the different ways your Spark data structures interact with
each other.

Section 2: Using pySpark for data analysis

1. Find the top 3 venues which hosted the most number of eliminator matches?

2. Return most number of catches taken by a player in IPL history?

3. Write a query to return a report for highest wicket taker in matches which were affected by
Duckworth-Lewis’s method (D/L method).

4. Write a query to return a report for highest strike rate by a batsman in non powerplay
overs(7-20 overs)
Note: strike rate = (Total Runs scored/Total balls faced by player) *100, Make sure that balls
faced by players should be legal delivery (not wide balls or no balls).

5. Write a query to return a report for highest extra runs in a venue (stadium, city).

6. Write a query to return a report for the cricketers with the most number of players of the
match award in neutral venues.

7. Write a query to get a list of top 10 players with the highest batting average Note: Batting
average is the total number of runs scored divided by the number of times they have been
out (Make sure to include run outs (on non-striker end) as valid out while calculating average).
8. Write a query to find out who has officiated (as an umpire) the most number of matches in
IPL.

9. Find venue details of the match where V Kohli scored his highest individual runs in IPL.

10. Creative Case study:

Please analyze how winning/losing tosses can impact a match and it's result? (Marks for
Visualization also)

Section 3: Expose Data

1. Create a database (use any relational DB preferably SQLite) and load data from the dataset
(Section 1) into db.

2. Create a Class Database. Class will need to have

1. Constructor to initialize dB connection and other variables

2. Methods implemented to return result-set for each query in Section 2, result-set should
be returned as a json/dict object.
3. Exception handling.
4. get_status method to ping database connectivity.

3. Feel free to create any additional classes or data structures you deem necessary.

4. Evaluation: Here is an input example

from database import Database

db = Database ()
qry1_result = db.get_query1_result()

5. Please follow industry standards while writing the code and include basic schema and data
validations.

6. Preferred Programming language – Python

Deliverables:
1. For Section 1, Section 2 and Section 3: A single colab notebook where you have developed
the code for the Section 1, Section 2 and Section 3. Download the colab notebook and email
it to us. We will run the colab notebook on our end and correct your submissions.

2. Your code will be evaluated not just on the final answers but on code quality and unit tests

Follow coding standards (PEP-8)

Appropriate error/exception handling
Modular function design
3. Your final submission should be sent to [email protected]. Your submissions are

due to us by end of day 7th April 2022 and subject should follow following pattern
<Internshala> Data Engineering Internship: Full Name
4. Please include your up-to-date resume, named as Firstname_Lastname.pdf

If you have any questions during the assignment, send your questions to
[email protected]

Good luck and we hope you learn something new in this process!

Yaksha Assessment User Guide 4
No ratings yet
Yaksha Assessment User Guide 4
2 pages
Object Serialization With Pickle, JSON and YAML PDF
No ratings yet
Object Serialization With Pickle, JSON and YAML PDF
10 pages
Blood Bank Report PDF
No ratings yet
Blood Bank Report PDF
101 pages
Business Accounting 1A - Assignment 1 - 0
No ratings yet
Business Accounting 1A - Assignment 1 - 0
4 pages
Copyright - 6th Semester
100% (1)
Copyright - 6th Semester
57 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
3 pages
Python Revision Tour
No ratings yet
Python Revision Tour
14 pages
Data Generalization
No ratings yet
Data Generalization
3 pages
Python All Programs
No ratings yet
Python All Programs
30 pages
Python Datastructures Course File
No ratings yet
Python Datastructures Course File
35 pages
Datagrokr Internship Technical Assignment - 20201017
No ratings yet
Datagrokr Internship Technical Assignment - 20201017
3 pages
Python Lab Programs - Chapter 2 To 4
No ratings yet
Python Lab Programs - Chapter 2 To 4
13 pages
Database Management System Notes
No ratings yet
Database Management System Notes
25 pages
Python Modules
No ratings yet
Python Modules
29 pages
Data Analytics With Python-1
No ratings yet
Data Analytics With Python-1
12 pages
Python Practice Exercise PDF
No ratings yet
Python Practice Exercise PDF
3 pages
Practical List of DBMS
No ratings yet
Practical List of DBMS
19 pages
Chapter 2 Introduction To R and Python
No ratings yet
Chapter 2 Introduction To R and Python
35 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
90 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
DSBDA Easy Solution 2019
No ratings yet
DSBDA Easy Solution 2019
58 pages
Javalab File
No ratings yet
Javalab File
167 pages
Evolution of Big Data
No ratings yet
Evolution of Big Data
21 pages
Python
No ratings yet
Python
16 pages
DML Practical 2
No ratings yet
DML Practical 2
2 pages
Python Questions With Solutions
No ratings yet
Python Questions With Solutions
3 pages
Unit 1
100% (1)
Unit 1
69 pages
Python Important
No ratings yet
Python Important
35 pages
Coding Interview Python Language Essentials
No ratings yet
Coding Interview Python Language Essentials
5 pages
Data Types in Python
No ratings yet
Data Types in Python
5 pages
22 PLC15 B
No ratings yet
22 PLC15 B
5 pages
BPLCK105B
No ratings yet
BPLCK105B
5 pages
List of Practicals of XII Computer Science 083 Practical Files 2022 23
No ratings yet
List of Practicals of XII Computer Science 083 Practical Files 2022 23
5 pages
Hive Lecture Notes
100% (1)
Hive Lecture Notes
17 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
43 pages
Infosys TQ Exam Sample Questions
No ratings yet
Infosys TQ Exam Sample Questions
13 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
18 pages
Sets in Python
No ratings yet
Sets in Python
7 pages
ds4015-big-data-analytics-vignesh-k-notes
No ratings yet
ds4015-big-data-analytics-vignesh-k-notes
146 pages
Python Programs Lab Manual
No ratings yet
Python Programs Lab Manual
18 pages
Module 4 - Strings and String Manipulation - Python Programming
No ratings yet
Module 4 - Strings and String Manipulation - Python Programming
47 pages
BA Lab Manual
No ratings yet
BA Lab Manual
62 pages
Introduction To Python Programming: Dr. R. Rajeswara Rao Professor & Head Dept. of CSE Jntuk-Ucev Vizianagaram
No ratings yet
Introduction To Python Programming: Dr. R. Rajeswara Rao Professor & Head Dept. of CSE Jntuk-Ucev Vizianagaram
27 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
14 pages
Practical Lab File Based ON Programing in C: Submitted by
No ratings yet
Practical Lab File Based ON Programing in C: Submitted by
6 pages
BDA_UNIT-4-PIG-Notes
No ratings yet
BDA_UNIT-4-PIG-Notes
9 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Intro to ETL
No ratings yet
Intro to ETL
43 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
L-2.9 Hmac Cmac
No ratings yet
L-2.9 Hmac Cmac
14 pages
Data Visualisation and Analytics
No ratings yet
Data Visualisation and Analytics
3 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
Pranav R Programming Lab File
No ratings yet
Pranav R Programming Lab File
41 pages
Python Programs PDF
100% (1)
Python Programs PDF
14 pages
Unit - V: Advanced Topics
No ratings yet
Unit - V: Advanced Topics
92 pages
Au Coe QP: Question Paper Code
No ratings yet
Au Coe QP: Question Paper Code
2 pages
Sona College of Technology: Laboratory Manual
No ratings yet
Sona College of Technology: Laboratory Manual
39 pages
Excel 2013/2016: Get Your Hands Dirty
From Everand
Excel 2013/2016: Get Your Hands Dirty
Sam Akrasi
No ratings yet
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
From Everand
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
CertSquad Professional Trainers
No ratings yet
Java servlet Second Edition
From Everand
Java servlet Second Edition
Gerardus Blokdyk
No ratings yet
05 SQL Synomyms Imp
No ratings yet
05 SQL Synomyms Imp
3 pages
Snowflake Schema
No ratings yet
Snowflake Schema
13 pages
INF5290 Ethical Hacking Lecture 1: Introduction To Ethical Hacking, Information Gathering
No ratings yet
INF5290 Ethical Hacking Lecture 1: Introduction To Ethical Hacking, Information Gathering
37 pages
N.M.D.C. LTD.: (A Govt. of India Enterprise)
No ratings yet
N.M.D.C. LTD.: (A Govt. of India Enterprise)
2 pages
TCS PREVIOUS YEARS+PAPERS OffCampusJobs4u
No ratings yet
TCS PREVIOUS YEARS+PAPERS OffCampusJobs4u
210 pages
New Doc 2018-10-10 PDF
No ratings yet
New Doc 2018-10-10 PDF
15 pages
BBI Programming Challenge - December 2020: Coding Instructions
No ratings yet
BBI Programming Challenge - December 2020: Coding Instructions
5 pages
Unit 4 - Internet of Things - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 4 - Internet of Things - WWW - Rgpvnotes.in PDF
12 pages
Unit 5 - Internet of Things - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Internet of Things - WWW - Rgpvnotes.in
16 pages
Znati Iot
No ratings yet
Znati Iot
47 pages
Research
No ratings yet
Research
7 pages
b2b Portal Matter
No ratings yet
b2b Portal Matter
9 pages
It's All Magic 365 Reflections on Astrology, Tarot, and Manifestation PDF DOCX Download
100% (11)
It's All Magic 365 Reflections on Astrology, Tarot, and Manifestation PDF DOCX Download
15 pages
SDS - NR - XIAMETER Q1-6083 - Exper RM Cabinet
No ratings yet
SDS - NR - XIAMETER Q1-6083 - Exper RM Cabinet
19 pages
BCG The Next Decade in Quantum Computing Nov 2018 21 R Tcm9 207859
No ratings yet
BCG The Next Decade in Quantum Computing Nov 2018 21 R Tcm9 207859
30 pages
Impact Crusher Sheet
No ratings yet
Impact Crusher Sheet
2 pages
Class Exercise 1 Identification of Accounting Elements
No ratings yet
Class Exercise 1 Identification of Accounting Elements
2 pages
CSF302 - Lab File - Ujjwal - 1000014201
No ratings yet
CSF302 - Lab File - Ujjwal - 1000014201
11 pages
HotSpot JVM GC Options Cheatsheet - A4 1+2
No ratings yet
HotSpot JVM GC Options Cheatsheet - A4 1+2
2 pages
Spare Parts List: Arc 400i & Arc 400i XC
No ratings yet
Spare Parts List: Arc 400i & Arc 400i XC
8 pages
Structure Warranty Certificate
No ratings yet
Structure Warranty Certificate
1 page
Chapter 3 ESS
No ratings yet
Chapter 3 ESS
4 pages
Idoc - Pub - Astm A570 Steel Grade 50 PDF
No ratings yet
Idoc - Pub - Astm A570 Steel Grade 50 PDF
1 page
06 - Membrane Separation Processes
100% (1)
06 - Membrane Separation Processes
141 pages
Problems On DC Motor Drives
No ratings yet
Problems On DC Motor Drives
2 pages
Hormonal Control of Spermatogenesis: by Abhinav Bhardwaj
No ratings yet
Hormonal Control of Spermatogenesis: by Abhinav Bhardwaj
10 pages
Important Facts and Trivia
No ratings yet
Important Facts and Trivia
85 pages
UG Business and Management Studies 14-15
No ratings yet
UG Business and Management Studies 14-15
15 pages
UNIT IV Digital Signatures and Authentication Protocols
No ratings yet
UNIT IV Digital Signatures and Authentication Protocols
32 pages
Case - CSR - Sustainability Initiatives at Natura, The Body Shop, and Aesop
100% (1)
Case - CSR - Sustainability Initiatives at Natura, The Body Shop, and Aesop
2 pages
Building Portfolio Website Presentation
No ratings yet
Building Portfolio Website Presentation
29 pages
The Collective Effort of The United Nations Specialised Agencies To Tackle Et Al
No ratings yet
The Collective Effort of The United Nations Specialised Agencies To Tackle Et Al
12 pages
Open Issued GO
No ratings yet
Open Issued GO
2 pages
2024-10-22_Customer_and_Partner_Roundtable_Steampunk
No ratings yet
2024-10-22_Customer_and_Partner_Roundtable_Steampunk
31 pages
Adam Smith 1st Edition Samuel Fleischacker download pdf
100% (3)
Adam Smith 1st Edition Samuel Fleischacker download pdf
81 pages
87-Totaal CV Mihai-Marin Alexandru - Electrical Engineer
No ratings yet
87-Totaal CV Mihai-Marin Alexandru - Electrical Engineer
18 pages
20 Load Sensing Hydraulic Systems
No ratings yet
20 Load Sensing Hydraulic Systems
4 pages
ĐỀ_37_-_CHUẨN_CẤU_TRÚC_ĐỀ_MINH_HOẠ_(THI_THỬ_THPT_THÁNG_2) (1)
No ratings yet
ĐỀ_37_-_CHUẨN_CẤU_TRÚC_ĐỀ_MINH_HOẠ_(THI_THỬ_THPT_THÁNG_2) (1)
12 pages

DataGrokr Technical Assignment - Data Engineering - Internshala

Uploaded by

DataGrokr Technical Assignment - Data Engineering - Internshala

Uploaded by

Assessment for DE Internship at DataGrokr

What you need to do:

Section 1: Setting up PySpark in Colab and loading some data sets.

Section 1: Environment setup and data loading

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Setup Environment variables for Java and Spark

!pip install -q findspark

from pyspark.sql import SparkSession

Section 2: Using pySpark for data analysis

2. Return most number of catches taken by a player in IPL history?

10. Creative Case study:

Section 3: Expose Data

2. Create a Class Database. Class will need to have

1. Constructor to initialize dB connection and other variables

4. Evaluation: Here is an input example

from database import Database

6. Preferred Programming language – Python

Follow coding standards (PEP-8)

You might also like