0% found this document useful (0 votes)
4 views

Coding_Test_problem_description_RA_final

The AI-GA dataset contains 14,331 abstracts, split between 7,248 AI-generated and 7,083 original samples, formatted in CSV with columns for doc_id, abstract, title, and label. The document outlines three coding tasks: identifying the top 10 most frequent words in different categories of abstracts, finding 10 abstracts similar to a specific one, and building a deep learning model to classify abstracts. Submissions should include Python code, outputs, requirement files, and Readme files detailing methods and implementations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Coding_Test_problem_description_RA_final

The AI-GA dataset contains 14,331 abstracts, split between 7,248 AI-generated and 7,083 original samples, formatted in CSV with columns for doc_id, abstract, title, and label. The document outlines three coding tasks: identifying the top 10 most frequent words in different categories of abstracts, finding 10 abstracts similar to a specific one, and building a deep learning model to classify abstracts. Submissions should include Python code, outputs, requirement files, and Readme files detailing methods and implementations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Description of Data

The AI-GA (Artificial Intelligence Generated Abstracts) dataset is a collection of abstracts, either AI-
generated or original. The AI-generated abstracts are generated using state-of-the-art language
generation techniques, specifically, the GPT-3 model.

The dataset is provided in the CSV format, with each row representing a single sample.

Total sample size: 14,331 (7,248 AI-generated and 7,083 original)

Each sample contains four columns: doc_id, abstract, title, and label. The label indicates whether the
sample is an original abstract (labeled as 0) or an AI-generated abstract (labeled as 1).

Coding tasks

Task 1. Find the top 10 most frequent words in AI-generated, original, and all the abstracts respectively

Task 2. Using a natural language processing (NLP) method, identify 10 abstracts from the whole dataset
that are most similar to the 5th abstract (doc_id = 4) in their content.

Task 3. Build a neural network/deep learning model to predict whether an abstract is AI-generated or
original. You are encouraged to use common deep learning frameworks such as PyTorch or TensorFlow.

What to Submit

Please submit all the required files (with proper file names) in a single .zip file through email.

Task 1.

a. Python code

b. The top 10 words in each of those three categories (AI-generated, original, AI-generated+original)

Task 2:

a. Python code
b. Output from your code: 10 abstracts most similar to the 5th abstract

c. A requirement.txt specifying all dependent libraries and their versions being used

d. A Readme file that describes


- The key idea and reasoning behind your method for measuring the content similarity between
abstracts and why you chose this method

- What has been implemented and what might have been left out due to time limit

Task 3.

a. Python code

b. Output from your code that shows the performance of your classifier

c. A requirement.txt specifying all dependent libraries and their versions being used

d. A Readme file that describes

- The key idea and reasoning behind your algorithm/model and why you chose this type of
algorithm/model

- What has been implemented and what might have been left out due to time limit

- Key performance metrics used in training and testing

You might also like