0% found this document useful (0 votes)
71 views24 pages

Duplicate Question Detection: Using Random Forest Algorithm

The document summarizes a project that used the random forest algorithm to detect duplicate questions. It includes the following key points: 1. The team collected question data from Kaggle to train their random forest model to predict whether a new user question was duplicate or not. 2. The random forest algorithm was chosen because it is an ensemble of decision trees that can handle both categorical and numerical data well. 3. The implementation included data preprocessing like lowercasing and stop word removal, feature extraction using fuzzy string matching and distances, and an architectural design with inputs, processing, and output of the prediction.

Uploaded by

vashkar parajuli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views24 pages

Duplicate Question Detection: Using Random Forest Algorithm

The document summarizes a project that used the random forest algorithm to detect duplicate questions. It includes the following key points: 1. The team collected question data from Kaggle to train their random forest model to predict whether a new user question was duplicate or not. 2. The random forest algorithm was chosen because it is an ensemble of decision trees that can handle both categorical and numerical data well. 3. The implementation included data preprocessing like lowercasing and stop word removal, feature extraction using fuzzy string matching and distances, and an architectural design with inputs, processing, and output of the prediction.

Uploaded by

vashkar parajuli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Duplicate Question Detection

Using Random Forest Algorithm

Team Members:
1. Arjun Shrestha
2. Sanjeev Roka
Research Presentation
3. Sushant Khakurel
4. Vijay Dhakal
Under the supervision of
Surya Bam
CONTENTS
1 Introduction
-Problem Definition -Objective -Limitations

2 Methodology
-Data Collection -Algorithm Used

3 Implementation
-Architectural Design -Use Case Diagram -Sequence Diagram
3
4 Demonstration

54
Conclusion
2
INTRODUCTION

3
Problem Definition
• With duplicate Questions:

• There is load in the


database.

• Answerers have to give


same answers repeatedly.

4
Objective
• Allow User to ask the
question.

• Predict whether the


similar question has been
previously asked or not.

5
Limitations
• Difficult to find the
semantics

• Ambiguity in natural
language

6
METHODOLOGY

7
Data collection
• Collected from kaggle
released by Quora.

• Used only a fraction.


i.e. 8000

8
Random Forest Algorithm
• Supervised Machine
Learning Algorithm

• Ensemble of Multiple
Decision trees.

• Is a CART algorithm

9
How random forest works ?

1. Randomly select “k” features from total “m” features where k << m

2. Among the “k” features, calculate the node “d” using the best split point.

3. Split the node into child nodes using the best split.

4. Repeat 1 to 3 steps until “l” number of nodes has been reached.

5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of
trees.

10
Split the dataset

11
Selection of features
•  Features for each tree is
selected in random

• We used,
-

12
Finding Best Split
• For each selected features,
calculate Gini Index.

• Select the feature with


minimum gini index.

• Split the tree on that node

13
14
Architectural Design
Inputs Question
User Interface
Result
User
s Pre-Processor
ro ces
e -p
Pr

Duplicate Question Get Features


Feature Extractor
Detection
Pr
ed

Fetch
i ct
Random Forest
Model

Questions
Collection

15
Data Preprocessing
• Lower Casing
• Removing noises
• Tokenization
• Stop Word Removal
• Lemmatization
• Translation into vectors

16
Feature Extraction
• Simple features

• Fuzzy Wuzzy features (Based on Edit


distances)

• Distance based features

17
Input Input Input

18
Use Case Diagram

19
Sequence Diagram:
Slide 3

20
Tools Used

21
Demonstratio
n

22
Conclusion
Any Questions?
Input
Input Input

Tree 1
Tree 2 Tree 3

24

You might also like