Taask
Taask
Applicants must choose any one theme and complete the task listed under that theme. The
task consists of two parts - a programming task and a paper reading task. You must attempt
both tasks.
For some tasks there are bonus tasks - to demonstrate that you can go the extra mile (which is
an important characteristic of being in our group)! We encourage you to try those bonus tasks.
You must attempt this task on your own. Please make sure you cite any external resources that
you may use. We will also check your submission for plagiarism, including ChatGPT :)
Submission:
- Put all your code/notebooks in a GitHub repository. Maintain a README.md explaining
your codebase, the directory structure, commands to run your project, the dependency
libraries used and the approach you followed.
- The link to the GitHub repository will be asked for during the interview.
- Presentation / Report:
- Programming Task : Document the process and compile a presentation / report
summarizing your findings, methodologies, and any insights gained from the
analysis in a presentation/report. Your report must contain your methodology,
findings, results etc. On the first page, It must detail exactly what parts of the task
you did, and what parts of the task you did not do and why.
- Paper Reading: For paper reading task we expect you to create a video that
summarizes your analysis of the paper - video should be less than 5 mins. You
can use PowerPoint/Loom to quickly record the video over your slides. Keep the
video/slides simple, emphasize the technical content, rather than production
quality. We are keen to understand your understanding of foundational concepts.
Your analysis could include one or more of the following: your thoughts around
the summary of the paper, major strengths, weaknesses of the paper,
generalisability of these techniques, limitations, extended research directions
based on the paper, methodological insights et cetera. To get you started, you
can answer the following questions - only to get you started, you are free to
improvise, BE CREATIVE!
Can you break the CAPTCHA?
Classical OCR models rely on handcrafted features like edge maps and stroke patterns. While
effective in some settings, these approaches tend to fall short when faced with variations in
fonts, noisy backgrounds, or unconventional capitalizations. The most recent innovation in OCR
to handle this is neural. In this task, you will train a neural network to extract text from basic
images, essentially training a model to break CAPTCHAs.
Fun Fact: A certain project conducted in Precog used OCR models on a site to bypass their
CAPTCHA system and scrape data with a bot.
Task 1 - Classification
Select a subset of your generated dataset containing only 100 words from both the hard and
easy sets. Then, train a neural classifier to classify images into one of these 100 labels.
Experiment with the number of samples required to obtain reasonable accuracy and report a
thorough scientific evaluation of your model. Mention any challenges you faced in trying to train
this model and explain how you overcame them.
Task 2 - Generation
In the real world, CAPTCHAs are unpredictable and do not belong to 100 easy classes. In this
subtask, you will improve upon your architecture to extract the text itself present in the image
i.e. we input the image, and the output is the text embedded in the image. Keep in mind that
words can be of variable length, which you will need to account for. Be scientific in your
evaluation. This task will require moderate tinkering with architectures and hyperparameters to
achieve reasonable performance — document everything you’ve done. We do not expect you to
solve this task entirely but we do expect meaningful forward progress.
Task 3 - Bonus
In the task, we will work exclusively on the bonus set. Generation becomes harder in this setting
since the model now needs to learn how to output both in the forward direction and in reverse.
This makes training slightly more challenging, but we’re sure you can figure it out!
Pointers
● We strongly prefer that you do this task in PyTorch. This is not just our preference but
also the research standard.
● Please experiment and save all experiments you conduct. If you make any interesting or
surprising inferences, make sure to mention them.
● You may use Colab or Kaggle for access to GPUs; free-tier compute is sufficient for this
task.
● All the solutions are expected to be a neural network trained from scratch. No OCR
libraries.
The goal of Part 1 is to explore methods for generating dense word embeddings from a corpus
and assess their quality through various evaluations. These embeddings capture semantic
meaning of words in a continuous vector space, allowing for improved NLP applications like
similarity measurements, clustering, and analogy tasks.
1. Take any pre-trained monolingual word embeddings for English and Hindi.
2. Align the embeddings of English and Hindi by learning a transformation. One such
technique that allows you to learn a transformation is: Procrustes analysis. This is just to
give you a direction, you are encouraged to experiment with other alignment techniques.
3. Now you have cross lingual aligned embeddings. How can you evaluate if the cross
lingual alignment was effective? Please come up with an effective strategy to
quantitatively answer this question.
Paper Reading Task: Reasoning or Reciting? Exploring the Capabilities and Limitations of
Language Models Through Counterfactual Tasks
3. A-T-L-A-S Atlas!
Graphs are all around us, and nearly everything can be represented as a graph. Some common
examples are social networks, traffic flows, molecules, etc. But, something more fun that can
also have a graphical representation is games. Specifically, the game of Atlas. For those
unfamiliar, The Game of Atlas is a word game where players take turns naming places (e.g.,
cities, countries, or states) that start with the last letter of the previous place. For example, if one
player says "India," the next player must say a place starting with "A," like "Australia." Repeating
a place or failing to come up with a valid name results in elimination or loss of points. The game
continues until one player is left with no possible places to say out loud.
For simplicity, let us assume there are only two players in the game. Then, Atlas can be
represented as a directed graph, where there exists an edge from A to B if you can say B after
your opponent says A. This leads you to a graph like this. What do you think the color of the
nodes signify?
If we zoom into a country, it makes it clear what its neighbours represent. See the example of
Yemen above. It has incoming edges from countries ending with a “y”, and outgoing edges to
countries starting with an “n”.
In graph terms, we can gauge that the end condition for the game would be saying a place
name that will have no valid, non-repeatable, outgoing edges to other nodes, thus trapping your
opponent.
Dataset Creation
For this task, you are expected to collect and create your own dataset. A large and important
part of research is the quality of data you use to confirm hypotheses. You will be trying to solve
the game of Atlas on 3 graphs, with increasing difficulty.
1. Country Only: Atlas graph created only with the names of all officially recognised
countries (what “officially recognised” means is up to you, but please do cite your
source (: ).
2. City Only: Atlas graph created with the 500 most densely populated cities in the world.
3. Country+City: Atlas graph with data combined from 1 and 2.
Use the Networkx library to create the graph from your collected data. Ensure that the graph is
directed, and follows the rule “there exists an edge from A to B if you can say B after your
opponent says A”. Visualise the graphs using the library’s plotting functions (make them pretty,
use GPT ).
Resources:
- Networkx Python Library
- https://networkx.org/documentation/stable/reference/algorithms/community.html
- https://pytorch-geometric.readthedocs.io/en/latest/index.html
- https://www.youtube.com/playlist?list=PLoROMvodv4rPLKxIpqhjhPgdQy7imNkDn
- https://en.wikipedia.org/wiki/Graph_property
- https://en.wikipedia.org/wiki/Centrality
- https://www.youtube.com/watch?v=JheGL6uSF-4&t=117s
- https://pytorch-geometric.readthedocs.io/en/2.6.1/tutorial/shallow_node_embeddings.ht
ml?highlight=unsupervised
- 9. Link Prediction on MovieLens.ipynb
4. s/Math + AI/ Reasoning in LLMs
“A smooth sea never made a skillful sailor” ~ Franklin D. Roosevelt
The Unix tool sed is a simple yet powerful tool that can be used for pattern matching and text
replacement. Amusingly, someone created a puzzle site, where you need to arrive at the blank
string by replacing some text according to some patterns. We highly recommend you try some
puzzles to get a feel of the problem before continuing.
To understand if LLMs can solve puzzles like the sed-puzzle, we need to collect a large enough
collection so that we can reduce, if not eliminate, any statistical biases that exist in the data.
For this task, you need to create a dataset that is representative of the problems available on the
sed-puzzle website. The puzzles in your dataset should have a variety of difficulty levels.
Determining the difficulty is left to you ;)
Note: Do not scrape the internet for puzzles, we want you to rite your o n generator for it. This
also means no creating puzzles by hand, except for a tiny fraction. Also, it is very likely that the
puzzles have been scraped by companies that are training foundational models, so scraping
them may not provide an accurate evaluation of the abilities of the language model
A few pointers to keep in mind for dataset generation:
● It should contain an initial state (non-empty string) and a non-zero list of possible
transitions. The final state is always a blank string.
Example: [Hello world, sed puzzle]
Input: HELLOWORLD
Available transitions: HELLO -> ‘’”, WORLD -> ‘“’ (“” corresponds to empty string)
● Each datapoint in the dataset should be a valid puzzle (i.e. should have a solution, even if
it may be very long). Invalid puzzles will be penalised. [Hint: how can you create a
solution for a given puzzle?]
● The data points in the dataset should be saved as a JSON file. We have provided some
starter code to do the same, but you are free to use that and/or modify it to suit your
needs.
For the above example, the corresponding JSON file will be
{
“problem_id”: “000”,
"initial_string": "HELLOWORLD",
"transitions": [
{
"src": "HELLO",
"tgt": ""
},
{
"src": "WORLD",
"tgt": "”
}
]
}
● Keep the data points in the dataset limited to 100. You can generate more points if you
wish, but we would like you to pick the 100 most representative data points to form your
dataset.
Your dataset will be judged on the quality as well as the quantity of the dataset. There should be
sufficient examples that are challenging to simple uses of LLMs; otherwise, it can give a false
portrayal of the LLM capabilities.
To help you read and write puzzle files, we have provided some starter code, which can be found
at https://github.com/precog-iiith/sed-solver.
Task 2: Getting LLMs to solve your puzzles
“A bad orkman al ays blames his tools” ~ Apocryphal saying
LLMs are often bad at reasoning without additional effort. However, over the past years, there has
been substantial research in trying to get them to logically reason about problems.
Your task is to evaluate the various methods of prompting on your puzzles. You need to try at
least the following techniques and benchmark them across various samples from your dataset
(how you implement them is up to you):
1. Zero-shot prompting
2. Few-shot prompting
3. Chain of Thought [CoT] prompting
4. Be creative :)
Make sure to document your progress as you go. Make sure to keep track of what prompts
you submit, and what you observe. What sort of problems became solvable and what weren’t?
This will be as important as the final results itself. Even if you do not have a positive result, we
are okay as long as there is evidence of rigour.
Note: All experiments are to be performed on “traditional” (for the lack of a better word) LLMs.
Do NOT use o1/DeepSeek r1/qwq/other models with “reasoning” for this. If you wish, you
can add it as additional benchmarks.
Your solutions should be saved as JSON files as well (you can use the starter code for this!).
Using the previous example, the solution that we would expect would take the following form:
{
“problem_id”: “000”
“solution”: [0, 1]
}
The file should contain a solution array containing the indices in which the transitions are
performed from first to last (i.e. first apply HELLO -> ‘’ and then apply WORLD -> ‘’). Note
that even the following is a valid solution to the above puzzle:
{
“problem_id”: “000”
“solution”: [1, 0]
}
The goal is to find any list of transitions that makes the initial string empty and not to minimize the
number of transitions. To help compare with classical approaches, we have provided a baseline
implementation in the starter code.
We need a way to compare different methods and models with each other. Try to come up with
some quantitative way to assign a score to each output returned by LLMs and report it. What
works and what does not work? Give examples of how your metric performs in different
scenarios.
Can you find some examples that are easy for humans to solve, but difficult for LLMs to solve?
What about the other way around?
Have fun!