Turing Data Engineering Challenge

This data engineering challenge involves downloading 100,000 public GitHub repositories, processing the Python code in each to calculate various metrics like lines of code, libraries used, nesting depth of loops, code duplication, and outputting the results; distributed systems like AWS are recommended to efficiently handle the large amount of data; applicants need to clone the repositories, compute the specified statistics for each, and deliver the results and processing code.

Uploaded by

Osama Rasheed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

251 views4 pages

Turing Data Engineering Challenge

Uploaded by

Osama Rasheed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Turing
1900 Embarcadero Road #104,
Palo Alto, California 94303, U.S.A

Data Engineering
Challenge
Overview
The purpose of this challenge is to test your ability to obtain large amounts of data from web
sources and perform processing in a distributed manner under the given constraints.

In this challenge, you will download 100,000 public Github repositories and perform some
processing on the downloaded code. Usage of Amazon Web Services/Google Cloud Platform
and multiple instances to perform the processing is highly recommended. AWS free tier lets you
use about 750 hours/month of t2.micro instance which should be sufficient for this challenge.
Let us know if you face any problems while setting things up.

Instructions
Obtaining the data
A list of 100,000 repositories is provided h
ere. These repositories are around 2Mb and the
primary language is python. You need to clone all the repositories (only the master branch) and
find a fast and cost efficient way of obtaining the data and perform the processing described
below.

Processing
For each repository, our goal is to compute certa

in statistics only for the Python code present. Here is the list of items that you need to compute
for each repository:

1. Number of lines of code [this excludes comments, whitespaces, blank lines].

2. List of external libraries/packages used.
3. The Nesting factor for the repository: the Nesting factor is the average depth of a
nested for loop throughout the code.

#Loop 1
for i in range(100):
for elem in elements:
...do something..
for k in elem:
..do something..

#Loop 2
for i in range(100):
for elem in elements:
..do something..
for k in range(100):
..do something..

#Loop 1 has nesting depth of 3

#Loop 2 has nesting depth of 2
#The average nesting depth for the code is (3+2)/2 = 2.5

Note: You must report the average nesting factor for the entire repository and not
individual files.

4. Code duplication: What percentage of the code is duplicated per file. If the same 4
consecutive lines of code (disregarding blank lines, comments, etc. other non code
items) appear in multiple places in a file, all the occurrences except the first occurence
are considered to be duplicates.
5. Average number of parameters per function definition in the repository.
6. Average Number of variables defined per line of code in the repository.

Deliverables

1. An output file ‘results.json’ with the results of the computation in the following JSON
format. Each item in the Array list represents the result for a single repository.

[
{ 'repository_url': 'https://github.com/tensorflow/tensorflow',
'number of lines': 59234,
'libraries': ['tensorflow','numpy',..],
'nesting factor': 1.457845,
'code duplication': 23.78955,
'average parameters': 3.456367,
'average variables': 0.03674
}, ......
]

2. The code accompanied with a Readme containing instructions to run the code. Please
mention the dependencies, external packages, etc used in order to execute the code.

Grading Criteria
1. Use of distributed systems: We expect you to use multiple nano/micro instances to
distribute the workload.
2. Efficiency: Since these tasks require you to rummage through a lot of text data, you need
to make sure the algorithms and methods used to calculate the statistics are efficient
(time and memory).
3. Accuracy: How accurate are the statistics? Are all the edge cases covered? We are not
looking for exact answers and will accept anything as long as it is within 5% of the
original answer.
4. Optimisations: Given that we aren’t looking for 100% accuracy, can you trade off some
accuracy for a much faster method?
5. Comments: Well documented and commented code with docstrings, comments and/or
a readme.

Additional Questions

For any additional questions, please use the contact details listed in the email that
accompanied the challenge.

PMG The Final Chapters
No ratings yet
PMG The Final Chapters
23 pages
Sentence Structure Constituents
95% (20)
Sentence Structure Constituents
54 pages
Agathiyar Aruliya 64 Sithukal PDF
77% (31)
Agathiyar Aruliya 64 Sithukal PDF
89 pages
Sealing Promises
No ratings yet
Sealing Promises
63 pages
Without This Message by Purchasing Novapdf : Print To PDF
No ratings yet
Without This Message by Purchasing Novapdf : Print To PDF
70 pages
Without This Message by Purchasing Novapdf : Print To PDF
No ratings yet
Without This Message by Purchasing Novapdf : Print To PDF
69 pages
Agathiyarparipuranaagarathi
100% (1)
Agathiyarparipuranaagarathi
35 pages
Acc - No.25940 2 Sivagama Thushana Parikaram
No ratings yet
Acc - No.25940 2 Sivagama Thushana Parikaram
35 pages
திருப்பாவை
No ratings yet
திருப்பாவை
8 pages
சித்தர்கள் கண்ட தமிழ் முறை எண் கணிதம்
No ratings yet
சித்தர்கள் கண்ட தமிழ் முறை எண் கணிதம்
64 pages
Arutperumjothi Agaval Urai
No ratings yet
Arutperumjothi Agaval Urai
371 pages
AGATHIYAR
No ratings yet
AGATHIYAR
27 pages
அகத்தியர் வைத்திய காண்டம்.
No ratings yet
அகத்தியர் வைத்திய காண்டம்.
116 pages
திருமந்திரமாலை எனப்பெயர் பெற்ற திருமந்திரம்
No ratings yet
திருமந்திரமாலை எனப்பெயர் பெற்ற திருமந்திரம்
593 pages
Sidhar Potri
No ratings yet
Sidhar Potri
49 pages
அகத்தியர் கன்ம காண்டம் கௌமதி நூல்
No ratings yet
அகத்தியர் கன்ம காண்டம் கௌமதி நூல்
135 pages
சித்தர் நூல்கள் மொத்தம் 15
No ratings yet
சித்தர் நூல்கள் மொத்தம் 15
945 pages
Tamil Daily Calendar 2024 - 2005 தமிழ் தினசரி கால…
No ratings yet
Tamil Daily Calendar 2024 - 2005 தமிழ் தினசரி கால…
1 page
Kumara Sambhava Sarga 1 To 3 With Tamil Translation 1899
No ratings yet
Kumara Sambhava Sarga 1 To 3 With Tamil Translation 1899
57 pages
Aruvai Maruthuvam
No ratings yet
Aruvai Maruthuvam
92 pages
சரக்கு சுத்தி செய்முறைகள்
No ratings yet
சரக்கு சுத்தி செய்முறைகள்
142 pages
அகத்தியர் பஞ்ச்சகாவிய நிகண்டு சௌமிய சாகரம் 1200
No ratings yet
அகத்தியர் பஞ்ச்சகாவிய நிகண்டு சௌமிய சாகரம் 1200
227 pages
Sidhargalin Pooja Vidhigal
93% (15)
Sidhargalin Pooja Vidhigal
170 pages
Atma Jothi
100% (2)
Atma Jothi
19 pages
நந்தீஸ்வரர் சூத்திரம் 15
0% (1)
நந்தீஸ்வரர் சூத்திரம் 15
9 pages
Panjangam
100% (1)
Panjangam
37 pages
1 - Jananam Maranam Ragasiyam
100% (2)
1 - Jananam Maranam Ragasiyam
64 pages
Agathiyar Chenduram 300
No ratings yet
Agathiyar Chenduram 300
52 pages
akathiyarPooranaSuthiram PDF
No ratings yet
akathiyarPooranaSuthiram PDF
36 pages
Arokiyam 1964 Jan
No ratings yet
Arokiyam 1964 Jan
334 pages
ஶ்ரீ வைகுண்டஸ்தவம்
No ratings yet
ஶ்ரீ வைகுண்டஸ்தவம்
104 pages
Agathiyar 12000 Thoguppu PDF
No ratings yet
Agathiyar 12000 Thoguppu PDF
48 pages
An Analysis of Sri Vishnu Sahasranamam - Volume I
No ratings yet
An Analysis of Sri Vishnu Sahasranamam - Volume I
129 pages
Aofl Kasmolf: Kiepajs K - FJ Amof Fxjyaaf
No ratings yet
Aofl Kasmolf: Kiepajs K - FJ Amof Fxjyaaf
34 pages
இந்து தத்துவ இயல் ராகுல் சாங்கிருத்யாயன்
100% (1)
இந்து தத்துவ இயல் ராகுல் சாங்கிருத்யாயன்
69 pages
தமிழ்
100% (6)
தமிழ்
13 pages
Experiment 5 Design of Naca 4 Digit Aerofoil
No ratings yet
Experiment 5 Design of Naca 4 Digit Aerofoil
4 pages
சாலை ஆண்டவர்கள் எமனை அனுகவிடாது காக்கும் சத்திய பரிசுத்த வான்மதிக் கொரல்
100% (1)
சாலை ஆண்டவர்கள் எமனை அனுகவிடாது காக்கும் சத்திய பரிசுத்த வான்மதிக் கொரல்
338 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
JavaScript Introduction
From Everand
JavaScript Introduction
Lisa Saldivar
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
From Everand
Rust Package 100 Knocks: One-Hour Mastery Series 2024 Edition
Kanto
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Docker, Containers And All The Rest: First Edition, #1
From Everand
Docker, Containers And All The Rest: First Edition, #1
Ami Adi
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
For Studnets Updated Labmanual after approval
No ratings yet
For Studnets Updated Labmanual after approval
20 pages
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
Python Functions and Oop
No ratings yet
Python Functions and Oop
7 pages
Ruby Gems Mastery: 100 Essential Packages for 2024
From Everand
Ruby Gems Mastery: 100 Essential Packages for 2024
Kanto
No ratings yet
Application and Implementation of DES Algorithm Based on FPGA
From Everand
Application and Implementation of DES Algorithm Based on FPGA
madhav
No ratings yet
Living with Linux in the Industrial World
From Everand
Living with Linux in the Industrial World
Elaiya Iswera Lallan
No ratings yet
Lab Manual (AI)
100% (1)
Lab Manual (AI)
17 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
HW05 Student
No ratings yet
HW05 Student
6 pages
Assignment 7-12
No ratings yet
Assignment 7-12
18 pages
BA Hons Syllabus English NEP 2020
No ratings yet
BA Hons Syllabus English NEP 2020
84 pages
1 Solicitation Letter - Mayor
100% (1)
1 Solicitation Letter - Mayor
15 pages
Effect of Single-Dose Antibiotic Prophylaxis Versus Conventional Antibiotic Therapy in Surgery A Randomized Controlled Trial in A Public Teaching Hospital
No ratings yet
Effect of Single-Dose Antibiotic Prophylaxis Versus Conventional Antibiotic Therapy in Surgery A Randomized Controlled Trial in A Public Teaching Hospital
5 pages
Management of 46, XY DifferencesDisorders of Sex Development (DSD) Throughout Life Wisniewski2019
No ratings yet
Management of 46, XY DifferencesDisorders of Sex Development (DSD) Throughout Life Wisniewski2019
26 pages
PCB Nov2015
No ratings yet
PCB Nov2015
88 pages
WL Oral Presentation Rubric
No ratings yet
WL Oral Presentation Rubric
1 page
Dokumen - Pub - The Origin of Ashkenazi Jewry The Controversy Unraveled 9783110236064 9783110236057
No ratings yet
Dokumen - Pub - The Origin of Ashkenazi Jewry The Controversy Unraveled 9783110236064 9783110236057
248 pages
Marketing-Mix - Heritage - Draft For Spiral 2
No ratings yet
Marketing-Mix - Heritage - Draft For Spiral 2
89 pages
Summary of Bachelor Thesis Example
100% (2)
Summary of Bachelor Thesis Example
6 pages
Expert System PDF
No ratings yet
Expert System PDF
13 pages
Medieval Literature in Britain: The Chivalry Romances: Instructor: Professor Ecaterina Hanţiu PHD
No ratings yet
Medieval Literature in Britain: The Chivalry Romances: Instructor: Professor Ecaterina Hanţiu PHD
42 pages
Constructivist Theory by J. Bruner
80% (10)
Constructivist Theory by J. Bruner
9 pages
University of Puerto Rico Mayagüez Campus College of Engineering Syllabus & Instructor Information Sheet Form A. Course Syllabus
No ratings yet
University of Puerto Rico Mayagüez Campus College of Engineering Syllabus & Instructor Information Sheet Form A. Course Syllabus
5 pages
The Boke of Saint Albans - Berners, Juliana, B. 1388
No ratings yet
The Boke of Saint Albans - Berners, Juliana, B. 1388
218 pages
Chemistry On Clicks
No ratings yet
Chemistry On Clicks
18 pages
Classroom Interventions For Attention Deficit/Hyperactivity Disorder
No ratings yet
Classroom Interventions For Attention Deficit/Hyperactivity Disorder
10 pages
Corporate Finance Analysis - Royal Jordanian
No ratings yet
Corporate Finance Analysis - Royal Jordanian
16 pages
Asnake Zemenay PDF
No ratings yet
Asnake Zemenay PDF
93 pages
Practice 28
No ratings yet
Practice 28
9 pages
External Analysis: Customer Analysis
No ratings yet
External Analysis: Customer Analysis
18 pages
DepEd Assessment
No ratings yet
DepEd Assessment
34 pages
Ipad Grant
No ratings yet
Ipad Grant
3 pages
P150HM Manual
No ratings yet
P150HM Manual
268 pages
Saint Theresa College of Tandag, Inc
No ratings yet
Saint Theresa College of Tandag, Inc
48 pages
Fundamentals of Electrical Logging: Ified Jorm Al
No ratings yet
Fundamentals of Electrical Logging: Ified Jorm Al
8 pages
Screenshot 2023-10-25 at 11.17.41 PM
No ratings yet
Screenshot 2023-10-25 at 11.17.41 PM
145 pages
MIS Civil Hospital Report
No ratings yet
MIS Civil Hospital Report
22 pages

Turing Data Engineering Challenge

Uploaded by

Turing Data Engineering Challenge

Uploaded by

1. Number of lines of code​ [this excludes comments, whitespaces, blank lines].

#Loop 1 has nesting depth of 3

You might also like

1. Number of lines of code [this excludes comments, whitespaces, blank lines].