INTRO TO MACHINE LEARNING PROJECTS: IDENTIFYING DUPLICATE QUESTIONS

Identifying Duplicate Questions

Background and summary: This dataset was published by Quora for the purpose of solving the problem of identifying duplicate questions to simplify searching for answers to a question posed. As a simple example, the queries �What is the most populous state in the USA?� and �Which state in the United States has the most people?� should not exist separately on Quora because the intent behind both is identical. Having a canonical page for each logically distinct query makes knowledge-sharing more efficient, so that knowledge seekers can access all the answers to a question in a single location.

Goal: Given a sentence pair, identify if the sentences are semantically equivalent - that is, if the sentences are duplicates.

Input data: Over 400,00 lines of sentence pairs:
1. qid1, quid2: ID of question 1, 2
2. question1, question2: Text of each question
3. is_duplicate: Binary true/fase label indicating if the line is a duplicate pair
Data can be found here: Duplicate Questions
Relevant Research:
Quora
Paraphrase Detection
Semantic Similarity
Textual Entailment