Duplicate Question Detection: Using Random Forest Algorithm
Duplicate Question Detection: Using Random Forest Algorithm
Team Members:
1. Arjun Shrestha
2. Sanjeev Roka
Research Presentation
3. Sushant Khakurel
4. Vijay Dhakal
Under the supervision of
Surya Bam
CONTENTS
1 Introduction
-Problem Definition -Objective -Limitations
2 Methodology
-Data Collection -Algorithm Used
3 Implementation
-Architectural Design -Use Case Diagram -Sequence Diagram
3
4 Demonstration
54
Conclusion
2
INTRODUCTION
3
Problem Definition
• With duplicate Questions:
4
Objective
• Allow User to ask the
question.
5
Limitations
• Difficult to find the
semantics
• Ambiguity in natural
language
6
METHODOLOGY
7
Data collection
• Collected from kaggle
released by Quora.
8
Random Forest Algorithm
• Supervised Machine
Learning Algorithm
• Ensemble of Multiple
Decision trees.
• Is a CART algorithm
9
How random forest works ?
5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of
trees.
10
Split the dataset
11
Selection of features
• Features for each tree is
selected in random
• We used,
-
12
Finding Best Split
• For each selected features,
calculate Gini Index.
13
14
Architectural Design
Inputs Question
User Interface
Result
User
s Pre-Processor
ro ces
e -p
Pr
Fetch
i ct
Random Forest
Model
Questions
Collection
15
Data Preprocessing
• Lower Casing
• Removing noises
• Tokenization
• Stop Word Removal
• Lemmatization
• Translation into vectors
16
Feature Extraction
• Simple features
17
Input Input Input
18
Use Case Diagram
19
Sequence Diagram:
Slide 3
20
Tools Used
21
Demonstratio
n
22
Conclusion
Any Questions?
Input
Input Input
Tree 1
Tree 2 Tree 3
24