Software Engineer Intern Assignment

The document outlines an assignment to enhance the CDAP Wrangler library by adding support for parsing byte size and time duration units, which currently lacks built-in functionality. It details objectives such as modifying the grammar, updating Java code, implementing a new directive for aggregation, and creating comprehensive test cases. Additionally, it includes requirements for a web-based application for bidirectional data ingestion between ClickHouse and Flat File, emphasizing user interface design, error handling, and optional features.

Uploaded by

rakshithaitagi29

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Software Engineer Intern Assignment

Uploaded by

rakshithaitagi29

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Assignment: Enhance Wrangler with Byte Size and Time Duration Units Parsers

1. Background

The CDAP Wrangler library currently parses various data types but lacks built-in
support for easily handling units like Kilobytes (KB), Megabytes (MB), milliseconds
(ms), or seconds (s). Users often need to perform calculations or conversions on
columns representing data sizes or time intervals, which requires complex multi-step
recipes.

This assignment aims to enhance the Wrangler core library by adding native support
for parsing and utilizing byte size and time duration units within recipes. This involves
modifying the grammar, updating the parsing logic, extending the API, and
implementing a new directive to demonstrate the usage.

2. Objectives
● Integrate new lexer tokens (BYTE_SIZE, TIME_DURATION) into the Wrangler
grammar.
● Update the relevant Java code (wrangler-api, wrangler-core) to handle these new
token types.
● Implement a new aggregate directive (aggregate-stats) that utilizes these new
types.
● Develop comprehensive test cases, including aggregation scenarios using the
new units.
3. Detailed Tasks
● Fork the repo - https://github.com/data-integrations/wrangler to your own
github handle. All changes should be committed back to this repository with
a section in Readme.md about the usage of these 2 new parsers.
● (a) Grammar Modification (wrangler-core/src/main/antlr4/.../Directives.g4):
○ Add the lexer rules for BYTE_SIZE and TIME_DURATION (including helper
fragments BYTE_UNIT, TIME_UNIT).
○ Modify relevant parser rules (e.g., value, or create new specific rules like
byteSizeArg, timeDurationArg) to accept BYTE_SIZE and TIME_DURATION
tokens where appropriate for directive arguments.
○ Regenerate the ANTLR Java parser/lexer code using the appropriate build
process (e.g., mvn compile).
● (b) API Updates (wrangler-api module):
○ Create new Java classes ByteSize.java and TimeDuration.java extending Token
[wrangler-api/src/main/java/io/cdap/wrangler/api/parser/Token.java]
■ These classes should parse the token string (e.g., "10KB", "150ms") in
their constructor.
■ Provide methods to retrieve the value in a canonical unit (e.g., long
getBytes() for ByteSize)
○ Add BYTE_SIZE and TIME_DURATION to the Token Types
○ Update usage definition and token definition to support specifying these new
token types as valid directive arguments.
● (c) Core Parser Updates (wrangler-core module):
○ Modify by adding visit methods for the parser rules created/modified in step
3a (e.g., visitByteSizeArg, visitTimeDurationArg, or modifying visitValue if
applicable).
○ Hint : chk ctx.getText()).
○ Add the created token instances to the TokenGroup.
● (d) New Directive Implementation (wrangler-core module):
○ Create a new directive class, which can do aggregation, implementing the
Directive interface
○ UsageDefinition (define()): The directive should accept at least four
arguments:
1. ColumnName (source column with byte sizes).
2. ColumnName (source column with time durations).
3. ColumnName (target column name for total size).
4. ColumnName (target column name for total or average time).
Optionally, add arguments to specify output units (e.g., 'MB', 'GB',
'seconds', 'minutes') or aggregation type (total, average).
○ Initialization : Store the source and target column names provided in the
arguments.
○ Execution :
■ This directive should operate as an aggregate. So you need a store to
accumulate totals. Think about it. Hint : Chk ExecutorContext
■ For each row:
■ Read the byte size value from the source size column.
■ Read the time duration value from the source time column.
■ Add these values (converted to canonical units like bytes and
nanoseconds) to running totals stored in the Store.
■ Finalization : This method (if using RecipePipeline's aggregation
capabilities) or logic within the last execute call needs to:
■ Retrieve the final totals from the Store.
■ Perform unit conversions if required by arguments (e.g., convert total
bytes to MB, total nanoseconds to seconds).
■ Return a single new Row containing the target columns with the
calculated aggregate values (e.g., total_size_mb, total_time_sec).
● (e) Testing (wrangler-core module):
○ Add unit tests for the ByteSize and TimeDuration classes, verifying correct
parsing and canonical value retrieval for various inputs (e.g., "10kb", "1.5MB",
"5ms", "2.1s").
○ Add parser tests (e.g., in GrammarBasedParserTest.java or
RecipeCompilerTest.java to ensure recipes using the new syntax are parsed
correctly and invalid syntax is rejected.
○ Add comprehensive unit tests for the new AggregateStats directive. See the
Test Case Specification below.
4. Test Case Specification: Aggregation
● Input Data: Create a list of Row objects or parse a simple file representing
sample log or transaction data.
E.g You can use TestingRig
● Recipe:
String[] recipe = new String[] {
// Example: Aggregate size (output MB), total time (output seconds)
"aggregate-stats :data_transfer_size :response_time total_size_mb
total_time_sec"
// Add variations for average, median, p95, p99 and different output units if
implemented
};

● Execution: Use TestingRig.execute(recipe, rows) to run the recipe.

● Expected Output: Assert that the output contains a single row with the correctly
calculated aggregate values.
○ Size Calculation: Sum all data_transfer_size values (converted to bytes) and
then convert the final sum to Megabytes (MB) for the total_size_mb column
(using 1 MB = 1024 * 1024 bytes or 1000 * 1000 bytes - be consistent!).
○ Time Calculation: Sum all response_time values (converted to nanoseconds
or milliseconds) and then convert the final sum to seconds for the
total_time_sec column. (If implementing average, divide by the number of
rows before unit conversion).
○ Example assertion structure:
// results = TestingRig.execute(recipe, rows);
Assert.assertEquals(1, results.size());
Assert.assertEquals(expectedTotalSizeInMB,
results.get(0).getValue("total_size_mb"), 0.001); // Use tolerance for
float/double
Assert.assertEquals(expectedTotalTimeInSeconds,
results.get(0).getValue("total_time_sec"), 0.001);

5. AI Tools Usage:
● We strongly encourage taking AI coding assistance using any tool of your
choice
● Record the prompts you are using in these tools.
6. Deliverables
● Assignment will be only evaluated if committed to github.
● Modified Directives.g4 file.
● All new and modified Java source files (.java) within wrangler-api and
wrangler-core modules.
● All new and modified unit test files (.java) within wrangler-core.
● Evidence of successful build and test execution.
● If AI tooling is used - the set of prompts which you have recorded to be sent or
checked into github as a prompts.txt file

6. Evaluation Criteria (Example)

● Correctness: Do the new tokens parse correctly? Does the directive compute
aggregates accurately? Do tests pass?
● Code Quality: Is the code well-formatted, commented, and easy to understand?
Are existing patterns followed?
● Robustness: Does the code handle edge cases (e.g., zero values, large numbers,
different unit cases)?
● Test Coverage: Are the new features adequately tested?
Integration Assignment: Bidirectional ClickHouse & Flat File Data
Ingestion Tool
1. Objective:

Develop a web-based application with a simple user interface (UI) that facilitates data
ingestion between a ClickHouse database and the Flat File platform. The application
must support bidirectional data flow (ClickHouse to Flat File and Flat File to
ClickHouse), handle JWT token-based authentication for ClickHouse as a source,
allow users to select specific columns for ingestion, and report the total number of
records processed upon completion.

2. Core Requirements:
● Application Type: Web application (backend logic + frontend UI).
● Bidirectional Flow: Implement both:
○ ClickHouse -> Flat File ingestion.
○ Flat File -> Clickhouse ingestion.
● Source Selection: UI must allow users to choose the data source ("ClickHouse"
or "Flat File").
● ClickHouse Connection (as Source):
○ UI Configuration: Inputs for Host, Port (e.g., 9440/8443 for https, 9000/8123
for http), Database, User, and JWT Token.
○ Authentication: Use the provided JWT token via a compatible ClickHouse
client library.
○ Client Library: Use a client from the official list:
https://github.com/ClickHouse (Use any language of your choice - Golang,
Python, Java).
● Flat File Integration:
○ UI Configuration : Local Flat File name, Delimiters
○ Client Library - Use any IO library.
● Schema Discovery & Column Selection:
○ Connect to the source and fetch the list of available tables (ClickHouse) or the
schema of the Flat File data.
○ Display column names in the UI with selection controls (e.g., checkboxes).
● Ingestion Process:
○ Execute data transfer based on user selections.
○ Implement efficient data handling (batching/streaming recommended).
● Completion Reporting: Display the total count of ingested records upon
success.
● Error Handling: Implement basic error handling (connection, auth, query,
ingestion) and display user-friendly messages.
3. User Interface (UI) Requirements:
● Clear source/target selection.
● Input fields for all necessary connection parameters (ClickHouse source/target,
Flat File).
● Mechanism to list tables (ClickHouse) or identify Flat File data source.
● Column list display with selection controls.
● Action buttons (e.g., "Connect", "Load Columns", "Preview", "Start Ingestion").
● Status display area (Connecting, Fetching, Ingesting, Completed, Error).
● Result display area (record count or error message).
4. Bonus Requirements:
● Multi-Table Join (ClickHouse Source):
○ Allow selection of multiple ClickHouse tables.
○ UI element to input JOIN key(s)/conditions.
○ Backend logic to construct and execute the JOIN query for ingestion.

5. Optional Features (Enhancements):

● Progress Bar: Visual indicator of ingestion progress (can be approximate).
● Data Preview: Button to display the first 100 records of the selected source data
(with selected columns) in the UI before full ingestion.
6. Technical Considerations:
● Backend: Go or Java preferable. But any language accepted.
● Frontend: Simple HTML/CSS/JS, React, Vue, Angular, or server-side templates.
● ClickHouse Instance: Local (Docker) or cloud-based. Load example datasets for
testing.
● JWT Handling: Use libraries to manage JWTs if needed, primarily pass the token
to the ClickHouse client.
● Data Type Mapping: Consider potential type mismatches between ClickHouse
and Flat File/CSV.
7. Testing Requirements:
● Datasets: Use ClickHouse example datasets like uk_price_paid and ontime
(https://clickhouse.com/docs/getting-started/example-datasets).
● Test Cases:
1. Single ClickHouse table -> Flat File (selected columns). Verify count.
2. Flat File (CSV upload) -> New ClickHouse table (selected columns). Verify
count & data.
3. (Bonus) Joined ClickHouse tables -> Flat File. Verify count.
4. Test connection/authentication failures.
5. (Optional) Test data preview.

8. AI Tools Usage:
● We strongly encourage taking AI coding assistance using any tool of your
choice
● Record the prompts you are using in these tools.
9. Deliverables:
● Source code - Git repository check in is a must-have. Else wont be evaluated.
● README.md with setup, configuration, and run instructions.
● (Optional) Short demo video.
● If AI tooling is used - the set of prompts which you have recorded to be sent or
checked into github as a prompts.txt file

Tracer SC+ IOM Aug 2018
No ratings yet
Tracer SC+ IOM Aug 2018
208 pages
Keboola Advanced Training - Public PDF
No ratings yet
Keboola Advanced Training - Public PDF
76 pages
Digital Lab Manual Cadence PDF
100% (1)
Digital Lab Manual Cadence PDF
82 pages
FYP Proposal
No ratings yet
FYP Proposal
5 pages
MLOps Task (3)
No ratings yet
MLOps Task (3)
2 pages
Fresher AI Engineer Assignment
No ratings yet
Fresher AI Engineer Assignment
4 pages
Vsi Iiiiii
No ratings yet
Vsi Iiiiii
28 pages
Module 4
No ratings yet
Module 4
23 pages
Assignment 1
No ratings yet
Assignment 1
1 page
Assignment 2: Introduction To Systemc: 1 Counter
No ratings yet
Assignment 2: Introduction To Systemc: 1 Counter
3 pages
integrate the OKX DEX API for paper trading using your statistical arbitrage model written in Python
No ratings yet
integrate the OKX DEX API for paper trading using your statistical arbitrage model written in Python
6 pages
conversation (4)
No ratings yet
conversation (4)
5 pages
dotnetcoding
No ratings yet
dotnetcoding
2 pages
PPL QB Answers
No ratings yet
PPL QB Answers
24 pages
Hibernate With Example
100% (1)
Hibernate With Example
32 pages
SCS16L
No ratings yet
SCS16L
3 pages
Assignment_ Earnest Data Analytics (2)
No ratings yet
Assignment_ Earnest Data Analytics (2)
5 pages
Milan vashisth web development and ADAS
No ratings yet
Milan vashisth web development and ADAS
14 pages
Assignment WRT Lab5: Solve Exercise 2 and Upload The Solution With All The Graphs. Do Not Forget To Rename The Folder With Your BITS ID
No ratings yet
Assignment WRT Lab5: Solve Exercise 2 and Upload The Solution With All The Graphs. Do Not Forget To Rename The Folder With Your BITS ID
3 pages
Programming Assignment: EE382C, Spring 2020
No ratings yet
Programming Assignment: EE382C, Spring 2020
7 pages
Wms Ui Doc Standard
No ratings yet
Wms Ui Doc Standard
4 pages
CiteCOre Advanced Scanerio Based Question
No ratings yet
CiteCOre Advanced Scanerio Based Question
8 pages
EE769 Assignment 3
No ratings yet
EE769 Assignment 3
1 page
KT 01 Intro2Keras
No ratings yet
KT 01 Intro2Keras
24 pages
Backend Engineering Take-Home Assignment
No ratings yet
Backend Engineering Take-Home Assignment
4 pages
VTU Digital Lab Manual
No ratings yet
VTU Digital Lab Manual
88 pages
Lab9 - Counters, Timers, and Real-Time Clock
No ratings yet
Lab9 - Counters, Timers, and Real-Time Clock
6 pages
Assign 1-Statistical Summaries Using Pthreads
No ratings yet
Assign 1-Statistical Summaries Using Pthreads
4 pages
QSIC 2012 Presentation Automated Testing of Web
No ratings yet
QSIC 2012 Presentation Automated Testing of Web
27 pages
Ap 1 Unleashing
No ratings yet
Ap 1 Unleashing
25 pages
Advanced Django Signals
No ratings yet
Advanced Django Signals
10 pages
SWE2017 - Lab Assignment 1pages-7
No ratings yet
SWE2017 - Lab Assignment 1pages-7
5 pages
CC-Q3
No ratings yet
CC-Q3
5 pages
Cs336 Spring2024 Assignment2 Systems
No ratings yet
Cs336 Spring2024 Assignment2 Systems
30 pages
AWP NOV-19 (Sol) (E-Next - In)
No ratings yet
AWP NOV-19 (Sol) (E-Next - In)
14 pages
CitectSCADA Assignment 2
No ratings yet
CitectSCADA Assignment 2
24 pages
Homework 5 Yessine Labyedh
No ratings yet
Homework 5 Yessine Labyedh
28 pages
Tema 4 Tutorial DC
No ratings yet
Tema 4 Tutorial DC
12 pages
Types of JIT
No ratings yet
Types of JIT
15 pages
Grid and Cloud Computing(181 Copies)
No ratings yet
Grid and Cloud Computing(181 Copies)
56 pages
WEEK6REPORT
No ratings yet
WEEK6REPORT
4 pages
Zeotap Intern Assignment
No ratings yet
Zeotap Intern Assignment
4 pages
Indian Institute of Space Science and Technology AV 341: Computer Networks Lab
No ratings yet
Indian Institute of Space Science and Technology AV 341: Computer Networks Lab
19 pages
Mock Test for Backend Lead (NodeJS)
No ratings yet
Mock Test for Backend Lead (NodeJS)
4 pages
A Interview Question
No ratings yet
A Interview Question
15 pages
Bluespec Compiler, Bluesim, and Development Workstation Release Notes
No ratings yet
Bluespec Compiler, Bluesim, and Development Workstation Release Notes
5 pages
AWP NOV-19 (Sol)
No ratings yet
AWP NOV-19 (Sol)
14 pages
Introduction To Unit Testing
No ratings yet
Introduction To Unit Testing
16 pages
Synthesis
No ratings yet
Synthesis
31 pages
Nzitf Velociraptor
No ratings yet
Nzitf Velociraptor
108 pages
INterview Tosca
No ratings yet
INterview Tosca
2 pages
ec6aacebc927a3e486cf1d1d858b4cf0_lowlevel_project
No ratings yet
ec6aacebc927a3e486cf1d1d858b4cf0_lowlevel_project
2 pages
Dev1 Cer
No ratings yet
Dev1 Cer
16 pages
05 Testing
No ratings yet
05 Testing
27 pages
rate_limit_in_springBoot
No ratings yet
rate_limit_in_springBoot
3 pages
Druid Concepts
No ratings yet
Druid Concepts
30 pages
6762cd654c948_Hiring_post_Final_
No ratings yet
6762cd654c948_Hiring_post_Final_
3 pages
Apcs Lab03
No ratings yet
Apcs Lab03
5 pages
Dot Net Core - Lab
No ratings yet
Dot Net Core - Lab
30 pages
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
EMPOWERMENT TECHNOLOGIES FINAL EXAM
No ratings yet
EMPOWERMENT TECHNOLOGIES FINAL EXAM
2 pages
Dev Ops Engineer - Dariel Software Agency
No ratings yet
Dev Ops Engineer - Dariel Software Agency
3 pages
M.S. Software Engineering at Wipro Technologies (WASE) : 4Q X 5M 20marks
No ratings yet
M.S. Software Engineering at Wipro Technologies (WASE) : 4Q X 5M 20marks
2 pages
SONOACE X4 Product Catalog
No ratings yet
SONOACE X4 Product Catalog
6 pages
Ppe Acos5 Evo 1.04 1
No ratings yet
Ppe Acos5 Evo 1.04 1
16 pages
How To Build A Cape Cod Roof by Adjusting Ceiling Heights - 772
No ratings yet
How To Build A Cape Cod Roof by Adjusting Ceiling Heights - 772
5 pages
Cyber Security Greens Syllabus
No ratings yet
Cyber Security Greens Syllabus
35 pages
Manufacturing Terms Glossary
No ratings yet
Manufacturing Terms Glossary
9 pages
Omsdk Session Client
No ratings yet
Omsdk Session Client
18 pages
Practical Machine Learning With Python and Scikit Learn
No ratings yet
Practical Machine Learning With Python and Scikit Learn
23 pages
new_Syllabus_Of_Azure_Suite
No ratings yet
new_Syllabus_Of_Azure_Suite
13 pages
WBP Micro Project
No ratings yet
WBP Micro Project
18 pages
CMDB
100% (1)
CMDB
16 pages
System 1 Classic Datasheet - 174590 - RR
No ratings yet
System 1 Classic Datasheet - 174590 - RR
19 pages
3.decision Making and Looping
No ratings yet
3.decision Making and Looping
3 pages
Secnav M-5239.1 PDF
No ratings yet
Secnav M-5239.1 PDF
43 pages
Ucl PHD Thesis Font Size
100% (3)
Ucl PHD Thesis Font Size
6 pages
Connecting Db2 To Vb6.0
No ratings yet
Connecting Db2 To Vb6.0
7 pages
Minimization of Blast Furnace Fuel Rate by Optimizing Burden and Gas Distributions
No ratings yet
Minimization of Blast Furnace Fuel Rate by Optimizing Burden and Gas Distributions
1 page
Aquila: 3-Part WBC Differential Hematology Analyzer
100% (1)
Aquila: 3-Part WBC Differential Hematology Analyzer
4 pages
GPS Tracker and Fuel Sensor Installation Steps 1.2-1
No ratings yet
GPS Tracker and Fuel Sensor Installation Steps 1.2-1
2 pages
CSEC Theory Exam 97 - 2002 Answers
100% (1)
CSEC Theory Exam 97 - 2002 Answers
25 pages
Create Invoice or Credit Memo Purchase Order Related
No ratings yet
Create Invoice or Credit Memo Purchase Order Related
15 pages
Chapter 07 Operational AmplifierITDel 01-1
No ratings yet
Chapter 07 Operational AmplifierITDel 01-1
82 pages
Awconqueror Specimen
No ratings yet
Awconqueror Specimen
21 pages
Signed - Sesi 2 - Annex A v1
No ratings yet
Signed - Sesi 2 - Annex A v1
13 pages
Philips Hue Iris White and Color Zwart
No ratings yet
Philips Hue Iris White and Color Zwart
8 pages
Prof. Dr. Zoran PANDILOV
No ratings yet
Prof. Dr. Zoran PANDILOV
32 pages