0% found this document useful (0 votes)

25 views42 pages

Sentiment Analysis Using NLP

Uploaded by

5037 PAVITHRA M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views42 pages

Sentiment Analysis Using NLP

Uploaded by

5037 PAVITHRA M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

SENTIMENT ANALYSIS USING NLP

(INTERNSHIP REPORT)

Enrol. No. - 181322

Name of Student - Devansh Kaushik

Project Supervisor - Dr. Amol Vasudeva

May - 2022

Submitted in partial fulfilment of the Degree of

Bachelor of Technology
In
Computer Science Engineering

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING &

INFORMATION TECHNOLOGY
JAYPEE UNIVERSITY OF INFORMATION TECHNOLOGY, SOLAN
DECLARATION

I hereby declare that this submission is my own work and that, to the best of my
knowledge and belief, it contains no material previously published or written by another
person nor material which has been accepted for the award of any other degree or
diploma of the university or other institute of higher learning, except where due
acknowledgement has been made in text.

Date: 30 May 2022 Name: Enroll. No:

Devansh Kaushik 181322

(i)
CERTIFICATE

This is to certify that Devansh Kaushik was working as an intern with Paxcom since 7th
February 2022 on developing different product related feature and also integrating
various tools. This letter was issued on request from the employee and the company
bears no responsibility or liability on behalf of the employee for any transaction that
may arise.

Name of Supervisor: Sakshi Sehgal

Designation: Senior Data Scientist
Date: 27th May

(ii)
ACKNOWLEDGEMENT

I have taken efforts in this internship. However, it would not have been possible
without the kind support and help of many peers and organisation. I would like to
extend our sincere thanks to all of them. I am highly indebted to Ms. Sakshi Sehgal
for her guidance and constant supervision as well as for providing information
regarding the internship and also for their support in completing the projects given
during internship. My thanks and appreciations also go to my colleagues in
developing the project and people who have willingly helped us out with their
abilities.

(iii)
TABLE OF CONTENTS
Chapter No. Topics Page No.

Chapter – 1 Introduction 1-3

1. General Introduction
2. Problem Statement
3. Solution Approach

Chapter – 2 Libraries and Software Used 4-7

1. Python
2. Jupyter
3. Google Colab
4. LabelImg
5. Doccano
6. Pandas
7. NumPy
8. NLTK
9. Scikit-learn
10. Gensim
11. Transformers

Chapter – 3 Task Assigned Details 8-31

1. Dataset Description
2. Dataset Labelling
3. Pre-processing & Cleaning
4. Vectorizers
5. Word Embeddings
6. Word2Vec & GLoVe
7. Bert & ELMo
8. ML Algorithms
9. Boosting Algorithms

Chapter – 4 Conclusion 32

1. Conclusion
2. Future Work

Chapter – 5 References 33

(iv)
LIST OF ABBREVIATIONS

1. ML – Machine Learning
2. NLP – Natural Language Pre-processing
3. CPU – Central Processing Unit
4. GPU – Graphics Processing Unit
5. BERT - Bidirectional Encoder Representations from Transformers
6. NLTK – Natural Language Toolkit
7. GloVe – GLObal VEctors for word representation
8. RAM – Random Access Memory
9. TF – Term Frequency
10. IDF – Inverse Document Frequency
11. Approx. – Approximately
12. ROBERTa - Robustly Optimized BERT Pretraining Approach
13. POS – Parts of Speech
14. ELMo – Embeddings from Language Model
15. SVM – Support Vector Machines
16. AVL - Adelson, Velski & Landis

(v)
LIST OF FIGURES

1. NLP Pipeline Broad Overview

2. Using python to code function which gets performance results and best model

3. LabelImg being used to label barcodes and QR codes

4. Labelling texts using Doccano

5. 2 Datasets (i) Biscuit Feedback (ii) Electronics Feedback

6. Labelling Function

7. Pre-processing Flow Diagram

8. Pre-processing Pipeline Code – 1

9. Stemming vs Lemmatization

10. Pre-processing Pipeline Code – 2

11. Pos Tagging

12. One Hot Encoding

13. TF IDF Formula

14. TF IDF Implementation

15. TF IDF Results

16. Word Embeddings

17. CBOW model

18. Skip-gram model

19. Word2Vec Results

20. GLoVe Embeddings

21. GloVe Results

22. ELMo Embeddings

(vi)
23. ELMo Results

24. BERT Embeddings

25. BERT Results

26. Learning Curve

27. SVC

28. Pipeline Overview

29. Decision Tree

30. Random Forest

31. Boosting Algorithms

32. XG Boosting

33. Optimised Results

(vii)
ABSTRACT

Paxcom India Pvt Ltd., is an e-commerce and product-based company with vision of becoming
a leader in e-commerce analytics and automation across all key geographies. with the parent
company as PaymentUs, an integrated payment platform. Amongst the many teams Paxcom
consists of, such as IVR, e-commerce, ChatBot, I work in the ML team, which designs and
takes on projects from clients as well as other Paxcom teams, which helps us increasing
efficiency and utilizing man hours that can be saved by using a good model. The ML team
works together and creates multiple models, like Ensemble Learning, extract either features
from the dataset or makes a variety of algorithms as the features, thereby adding effective
algorithms just as training samples and increasing model performance metrics (f1, accuracy,
precision, depending upon project requirement). In the ML team, with the guidance and support
of my team lead and colleagues, I have gained industrial exposure in NLP and have been
working on hands-on projects on various datasets. Until now, A generalized Sentiment
Analysis pipeline have been made which when given a raw unlabelled dataset as input, will
label, pre-process, run across multiple algorithms, and finally give the best hyperparameter-
tuned model with results and classification report. I have worked on annotations/labelling, blur
classification, sentiment analysis and both supervised and unsupervised learning. With the help
of my team, I have strengthened my skillset in the domain of ML and NLP both theoretically
and hands-on, and will continue learning with deep neural networks currently. Hence this
project will be about how I approached the sentiment analysis problem, what I learnt along the
way and the respective performance and results of the same.

(viii)
Chapter 1. INTRODUCTION

1. General Introduction

Sentiment Analysis, as the name suggests, it means to identify the view or emotion behind
a situation. It basically means to analyze and find the emotion or intent behind a piece of text
or speech or any mode of communication, whose use cases can vary from a feedback form to
being used in automation based human emotion detection.

Polarity precision can be crucial when it comes to various businesses, for instance, when
analyzing customer feedback in survey responses and conversations, one can learn to detect
emotions and provide services and products accordingly. One of the ways to solve them is by
using NLP (Natural Language Processing), a branch of computer science concerned with
giving computers to understand vocabulary and grammar of texts.

Since the machine only understand zeroes and ones, we need to convert such grammatical
phrases to numerical data while retaining the contextual information of the same, and then
apply machine learning algorithms or deep learning networks, hence getting a model which
can understand sentiment like humans, and hence, then be used in end user applications. Thus,
in this internship, I tried to experience NLP a bit deeper and prepare myself for industry level
projects from my learnings of this use case.

The pipeline includes complete generalization, labelling the dataset with state-of-the-art
models, a series of preprocessing steps, then vectorizing the corpus or create embeddings, then
running a Gridsearch on multiple ML and boosting algorithms with cross-validation, evaluating
performance results and classification report and returning the best (non-overfitted) model.

(1)
Fig 1. NLP Pipeline Broad Overview

The project includes analysis of the different preprocessing steps and vectorizers and
algorithms used, and why specific algorithms and vectorizers/embeddings worked well than
others.

2. Problem Statement

The problem statement is to how to parse through thousands of lines of text corpus or feedback
survey records and detect human emotion and sentiment out of them, thereby efficiently
utilizing time and manpower. Emotion detection has been surprisingly significant in
businesses, and helps companies plan and tailor products according to the needs of end user.

(2)
3. Solution Approach

The solution is to create a generalized ML pipeline which when given an unlabelled / labelled
dataset, will preprocess the data vectorize and run ML models on the data, and at the end of
pipeline return the best hyperparameter tuned model, completely on its own with no human
interference.

(3)
Chapter 2. LIBRARIES AND SOFTWARE USED

1. Python

Python is a high-level interpretable language, used widely when it comes to ML and API
making. Python unlike C and Java, and minimize long codes into few lines with its rich libraries
and functions with the use of space indentation.

Fig 2. Using python to code function which gets performance results and best model

All pipelines and pre-processings have been done on Python as it’s tools and integrated systems
help in analyzing models and visualizing performances easily and saves time.

2. Jupyter

Jupyter notebooks is a web based local host platform with variable coding environments for
coding and data. It supports both CPU and GPU, hence one can use their remote PC’s system
and RAM to run models here.

(4)
3. Google Colab

While Jupyter notebooks are good when it comes to accessing local RAM and saving remote
data, Colab notebooks is also a web-based computing platform, but one can use Google’s GPU
and RAM for inferences. Notebooks and datasets can be either imported or saved at google
drive and code data can be easily shared.

4. LabelImg

LabelImg is an open-source graphical image annotation tool, where you can emphasize
meaningful areas of images that you want to label for classification.

Fig 3. LabelImg being used to label barcodes and QR codes

5. Doccano

Doccano is an open-source text annotation tool, analogous to LabelImg, just used in

emphasizing texts and later indexed in numerical data for NLP models.

(5)
Fig 4. Labelling texts using Doccano

6. Pandas

Pandas is a python library used to structure and manipulate data using for instance, data frame
and series. In this project, pandas will come in handy to tabulate data and apply operations.

7. Numpy

Numpy is a python library used for applying mathematical operations and transformations to
n-dimensional data structures.

8. NLTK

NLTK (Natural Language Tool Kit) is a textual data pre-processing library used to structure
and organize raw textual data for text analysis and visualizations. Functions such as tokenizer,
lemmatizator and stopwords copurs are some of the NLTK functions that come in handy whehn
pre-processing text.

9. Scikit-Learn

Scikit-Learn is a machine learning library for Python providing powerful preprocessing tools,
various algorithms for regression and classification, feature selection and extraction, and
dimensionality reduction functions.

10. Gensim

Gensim is an open-source library used for document indexing, and finding phrase/document
similarity. It also consists of Word2Vec function, which as the name suggests, converts words
and phrases to n-dimensional vectors, attempting to capture the context of phrases in a sentence
or a document. These came as an improvement of count and TF-IDF vectorizers.
(6)
11. Transformers

Transformers is an open-source library by HuggingFace widely know for it’s state of the art
NLP classification models for PyTorch, TensorFlow and JAX. The major use of transformers
in our pipeline will be for BERT, a state-of-the-art model which will help us in labelling
unlabelled data and providing word embeddings for corpus.

These were the libraries that were majorly used, other mentions could be of collections,
xgboost, catboost, tensorflow and more.

(7)
Chapter 3. TASK ASSIGNED DETAILS

1. Dataset Description

There are 2 unlabelled datasets merged together for this situation, first has people who bought
biscuits online feedback entries, second is of people who bought electronics online feedback
entries.

Fig 5. 2 Datasets (i) Biscuit Feedback (ii) Electronics Feedback

In total removing duplicates, there are 5000 text samples with 3 labels to classify: Positive (1),
Negative (-1) and Neutral (0). The entries are mostly in English, with some outliers of Hindi
and Tamil.
(8)
The problem with this dataset is, it is highly imbalanced with approx. 3500 entries of Positive
labelled data, 1400 of negative and a 100 of neutral.

Hence instead of accuracy, this situation calls for focussing on f1 as a performance metric since
even if the accuracy is high, that might also occur when our model is perfectly predicting the
positive ones but wrongly predicting the neutral entries by a large difference.

2. Dataset Labelling

As the dataset is unlabelled, our first step should be labelling the data with a very strong model.
This is where RoBERTa steps in, A Robustly Optimized BERT Pretraining Approach, which
is based on Google’s Bert model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
objective and training with much larger mini-batches and learning rates.

Fig 6. Labelling function

(9)
This pre-trained model itself takes care of stemming, lemmatizing, removing unknown
symbols and stopwords. One might think of straight away use the given model for all kinds of
problems, as honestly, it is the best. But the point to note is it is still a pretrained and a very
large model, present on HuggingFace’s cloud and remotely available for us, thus it takes a good
amount of time to use ROBERTa and consequently, requires the need for making our own
model tailored to our requirements.

Finally, the dataset is labelled and now requires manual review, as the model is best but migt
not be matching our requirements of labelling.

3. Preprocessing & Cleaning

Preprocessing a textual dataset requires a series of steps for it to be prepared for vectorizing or
in other words, converting it to numerical data.

Fig 7. Pre-processing Flow Diagram

1. Removal of unwanted symbols: This can be done by limiting the ASCII numbers to
small letters, capital letters and digits.

(10)
2. Tokenizer: Tokenizer creates a list of all words and letters from a sentence or paragraph,
disregarding all punctuations and special characters. Tokenizer provided by NLTK can
be of 2 types, sent_tokenize and word_tokenize. sent_tokenize is used to split a
document or paragraph into sentences while word_tokenize is used to split a sentence
into tokens.

Fig 8. Preprocessing Pipeline Code – 1

3. Expanding short hands: Expanding words such as don’t, can’t, it’s is essential as it helps
in recognizing them as stopwords and getting “not” word as a token, for negative
entries.
4. Removing stopwords: Now as the words have been expanded, NLTK’s stopwords
library can be used to recognize and remove the same.
5. Spellchecker: A good spellcheck such as TextBlob is used here to correct wrongly
spelled words.
6. Standardization: People might use slangs obtained from social networking sites in their
textual vocabulary, writing “good” as “gud”, “osm” for “awesome” are some of the
common ones.
7. Lemmatization: Lemmatization is bringing the token back to its actual form, for
example, “studies” to “study”. While stemming is also widely used, it has not been

(11)
applied here as stemming doesn’t give the actual word, it removes affixes out of words,
for example, “studies” to “studi”, which is not base word. Lemmatization gives us the
base word.

Fig 9. Stemming vs Lemmatization

Fig 10. Preprocessing Pipeline Code - 2

(12)
8. POS Tagging: Finally, POS-tagging, recognizing and filtering words based on their
categories of phrases, like removing proper nouns and keeping adjectives.

Fig 11. Pos Tagging

Thus, preprocessing is done, and now can remove empty and duplicate entries in the dataset,
and move om to the next step, i.e., vectorizing.

(13)
4. Vectorizers

There are 2 types of vectorizers mainly used, Count Vectorizer and TF IDF Vectorizer.

1. One Hot Encoding: Giving each word a one while all others are giving index 0. Easiest
implementation by far, but highly inefficient when it comes to dimensionality of matrix
and memory usage.

Fig 12. One Hot Encoding

2. Count Vectorizer: A simple vectorizer that provides indexes to all unique words, the
count of words will accordingly be given higher or lower priority by the model.

Again, a simple approach but with a lot of loopholes as it is incapable of identifying

relationships between words and works purely based on number of occurrences in the
whole document.

(14)
Fig 13. Count Vectorizer

3. TF-IDF Vectorizer: Term Frequency - Inverse Document Term frequency is based on

not only on count of word in the document, but also takes in account number of times
the word is occurring in a sentence.

Fig 14. TF – IDF Formula

This is a simple yet powerful tool, and in some cases work even better than Word2Vec, which
takes contextual information in account too. But it has 2 edge cases to it:

ZERO VALUE ISSUE: which occurs because of the IDF calculation. Imagine we have three
entries, and the word we are considering here is “cat”. In document A, cat makes up 70% of

(15)
the words. If cat appears once in document B and none times in document C, the TF-IDF value
would be high. But if cat appeared once in C too, tf-idf would fall all the way to zero, which is
a way too big contrast for just one occurrence change.

This is an undesirable feature as even though cat is a very important word here, TF IDF says
otherwise.

EXTENSIVE MARGIN ISSUE: again because of the IDF portion. This occurs because TF
IDF does take account of number of times the word cat occurs in the document, but doesn’t
take in account number of times it occurs in other documents.

Consider the same scenario taken above. Now suppose cat makes up 5% of words in document
A. If cat didn’t occur at all in document B, TF IDF value would be high, but if it came once in
B, TF IDF would again, fall down like a hero to zero.

Hence, even in this situation even though it condition is almost identical, TF IDF shows
complete change of behavior just by ne word difference. Hence, TF IDF is sometimes not
desirable when it comes to vectorizing.

The traits of TF IDF can be changed by giving proportional importance to words, but why do
that when we can move to contextual words vectorizing in methods such as Word2Vec and
Glove?

(16)
Fig 15. TF IDF Implementation

Fig 16. TF IDF Results

In the case TF IDF, SVM showed the best results a possible explanation for that is SVM
performs well at small datasets whereas the other algorithms such as boosting and

(17)
ensemble learning algorithms (decision tree & random forest), they are data hungry and hence
perform great at large datasets.
SVM works well when there is a clear set of separation and it uses a subset of training points
in the decision function (called support vectors), so it is also memory efficient. We will talk
about models later.

5. Word Embeddings

Word Embeddings basically, are n-dimensional (n specified by the programmer) vectors used
for representation of words, another way to convert text to numerical data. Before this even
though vectorizers had no dimensionality and just single digit values, they were completely
unknown to the context. Word embeddings can be used to take context into account and hence,
making the model altogether better.

Fig 17. Word Embeddings

The most famous example of contextual embeddings is:

Queen = King – Male + Female

(18)
There are 4 types of embeddings analysed on the pipeline:
1. Word2Vec
2. Glove
3. Bert
4. ELMo

Bert and ELMo are currently the best working embeddings currently, and are available as
pretrained models. But we for our use, will only use their embeddings in this case.

6. Word2Vec & Glove Embeddings

Word2Vec is a semi-supervised learning technique, like every other neural network. It is

supervised if you consider that the network has to learn from backpropagation, unsupervised if
you consider that no human expert makes labels. For instance, KNN for K-nearest neighbors
is a classification algorithm that in order to determine the classification of a point combines the
classification of K-nearest points. It is supervised because if you are trying to classify a point
based on the known classification of other points.

Fig 18. CBOW model

(19)
Word2Vec produces a produces a vector space, typically of several hundred dimensions, with
each unique word in the corpus such that words that share common contexts in the corpus are
located close to one another in the space. That can be done using 2 different approaches:
starting from a single word to predict its context (Skip-gram) or starting from the context to
predict a word (Continuous Bag-of-Words).

Fig 19. Skip-gram model

Hence, the neural network is fully used to predict the next word or predict the words around it,
either way we will only be taking the network weights as embeddings and don’t require the full
model.

Also, we will be preferring CBOW over Skip-gram here because firstly, CBOW trains faster
and can better represent frequent words, mainly because it works on context and maximizes
probability of target word by looking at context hence, making it easier to work on frequently
occurring words than the rarely occurring ones. And secondly, CBOW gave better results on
analysis.

(20)
Fig 20. Word2Vec results

As you can observe, Word2Vec results are not as good as TF IDF, mainly because of TF IDF
takes care of rarely occurring words as well, while Word2Vec CBOW doesn’t predict well at
rarely occurring words.

Also, once Word2Vec is applied or trained, it can’t be trained again. Hence every time a new
word comes in, Word2Vec will try to give an embedding accordingly in the n-dimensional
vector space to the word.

Word2vec relies only on local information of language words and proves suboptimal, we will
see how other embeddings prove over Word2Vec in this case. Also, it might be possible that
in our case the occurrences of words had a good effect on the model, thereby TF-IDF giving
good results and Word2Vec unable to.

(21)
Fig 21. GloVe Embeddings

Enter GLoVe, the model which considers both contextual and statistical information of a
corpus. In other words, it tries to capture both the count vectorizer technique (or co-occurrence
matrix) and Word2Vec’s prediction-based technique (word embeddings).

Fig 22. GLoVe Results

(22)
And look at what we have here, better results as soon as occurrence of words was taken into
account. Thus, the reasons were correct, number of occurrences is important in this dataset and
GLoVe improved over Word2Vec using the same. Even the overfitted results can be improved
over by some Hyper-parameter Tuning and vectorization.

Also, GLoVe is already trained over a big corpus, so just download the embeddings and you
have shorter training time compared to Word2Vec. This is great, but what if the embeddings
didn’t only rely on the neighbours?

7. BERT & ELMo Embeddings

Until now in Word2vec and Glove embeddings, we had to get sentence embeddings out of
word embeddings by adding and averaging them out. An effective method, but not an ethical
one as it doesn’t represent the sentence at all.

Fig 23. ELMo Embeddings

(23)
ELMo embeddings, tries to solve the disadvantages of Word2Vec and Glove by bringing in
contextual information in embeddings too, hence the weighted sum of the word vector and 2
intermediate word vectors gives us the resultant word vector.

Fig 24. ELMo Results

The results are still good, contextualized embeddings are better then isolated embeddings as
they do not grasp the environment around the same.

BERT on the other hand, has its own way of doing things. BERT creates a mask first, SEP
mask for separating lines and CLS at the beginning of text.

For UNK or unknown tag, BERT doesn’t use it this time. Rather, it will split the unknown
word into small pieces, and make embeddings out of it. For instance, if there was an embedding
present for the word “embed” and a new “embedding” word has come now, it will create the
embeddings for “ding” and “embed#####”, showing that it is similar to embed but might have
been used differently, look out.

Fig 25. Bert Embeddings

(24)
That’s why BERT is called state-of-the-art, it prepares itself for all the edge cases it can out
there.

Fig 26. BERT Results

The results are better than ever, thus in our case. BERT performed the best with many models
almost to the brim of perfection. Almost because if we observe, there are still some models
with overfitting present.

In my duration of internship, I used to think that 90 training acc. and 80 testing acc. (also known
as validation accuracy) means overfitting, but when it comes to deeper observation, here
catboost is overfitting as well, with 98 training acc. and 91.5 testing acc.

Thus, now we will go deep into all algorithms and fine tune them, and prevent them for
overfitting.

(25)
8. ML Algorithms

Logistic Regression

Logistic Regression in our case, didn’t had any overfitting, but if it needed, L1 (Lasso), L2
(Ridge) and Elastic Net (combined Ridge and Lasso) are the parameters used for regularization.

• Lasso (L1): λ·|w|

• Ridge (L2): λ·w²
• Elastic Net (L1+L2): λ₁·|w| + λ₂·w²

Not all of them are compatible with all kinds of solver Logistic Regression has. For instance,
LBFGS solver is only compatible with L2, not with L1 and Elastic Net. So, need to keep a
track on that. The difference between L1 and L2 is that one is absolute lambda and other is
squared lambda.

Fig 27. Learning Curve

(26)
Logistic and SVM were also underperforming because of MultinomialNB. When sending
training data, whenever the pipeline got MultinomialNB in the models list, MultinomialNB has
the requirement to only allow zero or positive values.

Hence when the dataset was scaled to positive exclusively for MultinomialNB and not for other
models, the accuracy of both SVM and Logistic Regression rose up.

Hence, SVM didn’t require any hyperparameter tuning. And MultinomialNB being purely
based on probability, didn’t have any parameters to tune.

Fig 28. SVC

Decision Tree and Random Forest, always tend to overfit. This is because by default the
scikit-learn’s tree depth is mentioned as “None”, so the tree naturally keeps splitting the nodes
(training samples) until all of them aren’t segregated.

(27)
Fig 29. Pipeline Overview

Decision Tree has multiple parameters that can be tuned:

• Criterion: Gini or Entropy

• Max_depth: Change “None” to a value and observe how it stops overfitting
• Min_samples_split: Just like an AVL tree, the node will simple into more only if the
training samples in the node are equal or above the given criteria.
• Min_samples_leaf: Minimum number of training samples that should be a present for
a node to be present.

These were the criterions I tuned and got appropriate results, there are more that can be looked
into.

(28)
Fig 30. Decision Tree

Fig 31. Random Forest

(29)
9. Boosting Algorithms

Boosting Algorithms are said to take weak learns and combine them into a strong one, but what
is the difference between Decision Tree algorithms and Boosting algorithms?

Ensemble Learning is joining multiple models to solve one problem, Bagging is a way to
decrease the variance in the prediction by generating additional data for training from the
dataset using combinations with repetitions to produce multi-sets of the original data. Boosting
is an iterative technique which adjusts the weight of an observation based on the last
classification. If an observation was classified incorrectly, it tries to increase the weight of this
observation.

Random Forest comes under bagging, while XGboost, Catboost and more come under boosting
algorithms.

Fig 32. Boosting Algorithms

Even though Boosting Algorithms have trees, they have a different way to make trees compared
to Decision Tree. Surely, they do have decision-based trees, yet they have good optimisation
techniques to the loss function. While Random Forest randomly takes random subsets and work
on them and returns the best subset of trees.

(30)
Fig 33. XG Boosting

Moving ahead, I had used Gradient Boosting, XG boosting and Catboost in my pipeline.
Gradient Boosting worked better when n_estimators were reduced from 100 to 50, reduced
overfitting (the number of sequential trees to be 40odelled).

XG Boost got optimized when regularization was Added and its learning ratewas tuned. Unlike
Decision trees XG Boost had reguliarization parameters (xgb: lambda = L2 regularization,
alpha = L1 reg, learning_rate a.k.a eta)

And Catboost too, was optimised with the help of Added L2 regularization and tuned
no_of_iterations (the number of trees in the model).

(31)
Chapter 5. CONCLUSION
1. Conclusion

Fig 33. Optimised Results

Hence, we learnt how to apply Supervised learning ML algorithms, input a raw dataset and
return the best hyperparameter tuned model with good embeddings.

2. Future Work

In the next part, an entire segment of Deep Learning is still left to apply and explore, hence
looking forward to that. There is still a lot of space where these models can be optimised and
embeddings can be made better. BERT embeddings are a black box, even though we know
how it works we are not sure how the weights of the neural network can relate to each other.

Multiple Researches are going into this and new aspects of learning are constantly flowing in,
so can look into that as well.

(32)
References

1. https://scikit-learn.org/stable/index.html
2. https://www.geeksforgeeks.org/
3. https://medium.com/
4. https://towardsdatascience.com/
5. https://stats.stackexchange.com/
6. https://stackoverflow.com/

(33)