Coding_Test_problem_description_RA_final
Coding_Test_problem_description_RA_final
The AI-GA (Artificial Intelligence Generated Abstracts) dataset is a collection of abstracts, either AI-
generated or original. The AI-generated abstracts are generated using state-of-the-art language
generation techniques, specifically, the GPT-3 model.
The dataset is provided in the CSV format, with each row representing a single sample.
Each sample contains four columns: doc_id, abstract, title, and label. The label indicates whether the
sample is an original abstract (labeled as 0) or an AI-generated abstract (labeled as 1).
Coding tasks
Task 1. Find the top 10 most frequent words in AI-generated, original, and all the abstracts respectively
Task 2. Using a natural language processing (NLP) method, identify 10 abstracts from the whole dataset
that are most similar to the 5th abstract (doc_id = 4) in their content.
Task 3. Build a neural network/deep learning model to predict whether an abstract is AI-generated or
original. You are encouraged to use common deep learning frameworks such as PyTorch or TensorFlow.
What to Submit
Please submit all the required files (with proper file names) in a single .zip file through email.
Task 1.
a. Python code
b. The top 10 words in each of those three categories (AI-generated, original, AI-generated+original)
Task 2:
a. Python code
b. Output from your code: 10 abstracts most similar to the 5th abstract
c. A requirement.txt specifying all dependent libraries and their versions being used
- What has been implemented and what might have been left out due to time limit
Task 3.
a. Python code
b. Output from your code that shows the performance of your classifier
c. A requirement.txt specifying all dependent libraries and their versions being used
- The key idea and reasoning behind your algorithm/model and why you chose this type of
algorithm/model
- What has been implemented and what might have been left out due to time limit