NLP Final Report

The document summarizes a CSE 447 project that used a combination of a recurrent neural network and a bi-character probabilistic model to select the top 3 characters for a given input. The models were trained on preprocessed Wikipedia data from 10 languages. An RNN with embeddings, hidden layers, and learning rate was used to get the top 50 characters, then a bi-character model selected the top 3 based on bigram probabilities. The project relied on libraries like PyTorch and used starter code from a Github project.

Uploaded by

api-436856993

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

258 views2 pages

NLP Final Report

Uploaded by

api-436856993

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

CSE 447 Project

Joseph Schafer, Paul Druta, Spencer Lu

Approach:
For our classifier, we used a combination of two models - a recurrent neural network, and a bi-
character probabilistic model. In order to select our top 3 characters, we first processed our input through
a recurrent neural network with 1 RNN layer, 64-dimensional embeddings, 128-dimensional hidden
layers, a batch size of 256, and a learning rate of .001 to get the top 50 most probable characters. Then,
once these characters are determined, we select the top 3 among these based off of our bi-character
probabilities, given the single most recent character of the input. The rationale behind this design decision
is that the bigram model is able to learn common pairs of tokens while the RNN model is able to account
for longer context. Both of these models are based off of the language modeling lecture, and the
corresponding bigram and RNN assignments in A2 and A3 [3].
Dataset:
To train our models, we downloaded a subset of Wikipedia data, from a selection of 10 - German,
English, Spanish, French, Italian, Japanese, Polish, Portuguese, Russian, and Mandarin Chinese [4]. We
selected these languages as they are the 10 featured languages on the homepage of Wikipedia, implying
that these languages’ Wikipedias are considered to be of comparatively high quality. Additionally, the
combination of these languages allow for some cross-language learning similarities (e.g. the large amount
of similarity between French, Spanish, Italian, and Portuguese, or between German and English), while
also allowing our model to learn some about other languages which treat characters far differently, like
Mandarin Chinese.
Since the download was of a format containing the entire article along with its links and other
markup syntax unique to Wikipedia, we then pre-processed this data by, among other things: removing
file links, and links to other pages; removing symbols surrounding headers; removing indicators for lists
of categories, and tables of contents; and removing XML markup and comments. For input into our
recurrent neural network, we considered one input per paragraph we processed in the wikipedia files,
while we looked at every bi-character pair in these samples (every pair within the input, as well as the last
pair of the input with the output) for our bi-character model. For the final checkpoint, we also trimmed
down our data size significantly, in order to remove a few more
Wikipedia-specific artifacts, but our initial language
distribution of samples (across train, dev, and test) was
depicted in chart 1, and while the raw numbers would change
in our final filtration step we significant distribution changes.
Our final training data sample consisted of 19,013,053 samples
for our recurrent neural network, and 780,322,237
corresponding training samples for our bi-character model.
References/Used work:
To build our RNN, we relied heavily on an adapted version of the sentiment classifier, developed
by and demonstrated in section of week 4 by the 447 TAs [5]. Additionally, our code is reliant on the
collections, os, random, xml, torch, tqdm, pickle, string, and argparse libraries. All of these libraries
except for tqdm and torch (the PyTorch library) are default libraries included with a python installation.
PyTorch was developed by Adam Paszke et. al., while tqdm was developed by Casper da Costa-Luis and
Stephen Karl Larroque [2, 1]. We also used the starter code from the given Github project also developed
by the 447 TAs [6].
References:

[1] Casper da Costa-Luis and Stephen Karl Larroque. 2021. tqdm. https://github.com/tqdm/tqdm. (2021).

[2] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang,
Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning
Library. In NeurIPS. 8024-8035. (2021).

[3] Noah Smith. 2021. Natural Language Processing (CSE 517 & 447): (Neural) Language Models.
https://drive.google.com/file/d/15xk-qyd3DFBLBYlTBDegfuZJKElJxuk4/view. (2021).

[4] Wikimedia Foundation. February 1, 2021. Wikimedia Downloads.

https://dumps.wikimedia.org/backup-index.html. (February 8, 2021).

[5] Zhaofeng Wu. 2021. 03_sentiment_classification_example.ipynb.

https://colab.research.google.com/drive/14GAMb7c6FbDnhWvqcliCZ8KYNvqdnQz7?usp=sharing.
(2021).

[6] Victor Zhong. 2021. cse447-project. https://github.com/vzhong/cse447-project. (2021).

Language Models and Application of Natural Language Processing
No ratings yet
Language Models and Application of Natural Language Processing
70 pages
G12 Phy Sci P2 June 2025 Marking Guidelines
No ratings yet
G12 Phy Sci P2 June 2025 Marking Guidelines
13 pages
08 NLP With Deep Learning
No ratings yet
08 NLP With Deep Learning
31 pages
Project Machine Translation
No ratings yet
Project Machine Translation
45 pages
Research Paper Llama
No ratings yet
Research Paper Llama
27 pages
Lab Lec 1a - Laboratory Rules and Safety Precautions
No ratings yet
Lab Lec 1a - Laboratory Rules and Safety Precautions
52 pages
Spanish Language Models
No ratings yet
Spanish Language Models
13 pages
NLP Exercises
No ratings yet
NLP Exercises
2 pages
M GTE
No ratings yet
M GTE
20 pages
SO Snippet ENASE
No ratings yet
SO Snippet ENASE
10 pages
Thesis Trinh Khoi
No ratings yet
Thesis Trinh Khoi
110 pages
Natural Language Processing With RNNs .Ipynb - Colaboratory
No ratings yet
Natural Language Processing With RNNs .Ipynb - Colaboratory
15 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Text Segmentation With Character-Level Text Embeddings: Grzegorz Chrupa La
No ratings yet
Text Segmentation With Character-Level Text Embeddings: Grzegorz Chrupa La
7 pages
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
No ratings yet
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
9 pages
Mirza Kayesh Begg - 250274290 - CompleteReport
No ratings yet
Mirza Kayesh Begg - 250274290 - CompleteReport
12 pages
اخلاق طبابت
No ratings yet
اخلاق طبابت
230 pages
Texttech Ex06 Solution
No ratings yet
Texttech Ex06 Solution
6 pages
Machine Translation Using Natural Language Process
No ratings yet
Machine Translation Using Natural Language Process
6 pages
Bpemb: Tokenization-Free Pre-Trained Subword Embeddings in 275 Languages
No ratings yet
Bpemb: Tokenization-Free Pre-Trained Subword Embeddings in 275 Languages
5 pages
Automation Simulation:: Your Gateway Into Smart Manufacturing
No ratings yet
Automation Simulation:: Your Gateway Into Smart Manufacturing
31 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
LLaMA Open and Efficient Foundation Language Models
No ratings yet
LLaMA Open and Efficient Foundation Language Models
27 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
10 21105 Joss 07489-3
No ratings yet
10 21105 Joss 07489-3
1 page
Trend
No ratings yet
Trend
47 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
2022 - Multilingual Training For Software Engineering
No ratings yet
2022 - Multilingual Training For Software Engineering
13 pages
Deep Learning Based Sentiment Analysis For Malayalam, Tamil and Kannada Languages
No ratings yet
Deep Learning Based Sentiment Analysis For Malayalam, Tamil and Kannada Languages
9 pages
CSE4062S21 Group3 Project Delivery7 FinalReport
No ratings yet
CSE4062S21 Group3 Project Delivery7 FinalReport
9 pages
Three 150224 Generative A I Intro
No ratings yet
Three 150224 Generative A I Intro
19 pages
Deep Learning For Dravidian Codemix Problem
No ratings yet
Deep Learning For Dravidian Codemix Problem
10 pages
LLM Review
No ratings yet
LLM Review
31 pages
Taask
No ratings yet
Taask
18 pages
Handout - Chaldean Oracles, Divination and Theurgy
100% (1)
Handout - Chaldean Oracles, Divination and Theurgy
5 pages
Dynamic Mixtures of Contextual Experts For Language Modeling
No ratings yet
Dynamic Mixtures of Contextual Experts For Language Modeling
11 pages
EJMTC1866511614549600
No ratings yet
EJMTC1866511614549600
7 pages
Botany in Berlin
100% (1)
Botany in Berlin
285 pages
Classification of Code Mixed Dravidian Text Using Deep Learning
No ratings yet
Classification of Code Mixed Dravidian Text Using Deep Learning
7 pages
ACM Conference Proceedings Primary Article Template
No ratings yet
ACM Conference Proceedings Primary Article Template
2 pages
RNN and LSTM Based Chatbot Using NLP: Department of Computer Science and Engineering, MSIT, New Delhi, India
No ratings yet
RNN and LSTM Based Chatbot Using NLP: Department of Computer Science and Engineering, MSIT, New Delhi, India
4 pages
A Study On Sentiment Polarity Detection From Multilingual Tweets
No ratings yet
A Study On Sentiment Polarity Detection From Multilingual Tweets
10 pages
E M D T CS AI: Xtracting Ultilingual Ictionaries For The Eaching OF AND
No ratings yet
E M D T CS AI: Xtracting Ultilingual Ictionaries For The Eaching OF AND
6 pages
The Duties and Responsibilities of A Garment Merchandiser
100% (9)
The Duties and Responsibilities of A Garment Merchandiser
10 pages
We Used Neural Networks To Detect Clickbaits: You Won't Believe What Happened Next!
No ratings yet
We Used Neural Networks To Detect Clickbaits: You Won't Believe What Happened Next!
7 pages
Catalogo Reductor
No ratings yet
Catalogo Reductor
106 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Code Explanation
No ratings yet
Code Explanation
8 pages
Attitude Is Everything
No ratings yet
Attitude Is Everything
27 pages
Eat Pray Love Reaction
100% (1)
Eat Pray Love Reaction
2 pages
215 PDF
No ratings yet
215 PDF
7 pages
Ballew CBB Needsassessment 8462
No ratings yet
Ballew CBB Needsassessment 8462
24 pages
BY:-Walabuma Lenjiso: Advisor
No ratings yet
BY:-Walabuma Lenjiso: Advisor
22 pages
Banyuhay: Katutubong Sayaw Sa Makabagong Pananaw Playbill
No ratings yet
Banyuhay: Katutubong Sayaw Sa Makabagong Pananaw Playbill
18 pages
Fluency Plus 6 - Unit 1.3 - Vocabulary
No ratings yet
Fluency Plus 6 - Unit 1.3 - Vocabulary
5 pages
Spring Security SAML - Documentation
No ratings yet
Spring Security SAML - Documentation
7 pages
VJTI Proposal - Aero
No ratings yet
VJTI Proposal - Aero
8 pages
Service Catalog
No ratings yet
Service Catalog
3 pages
UT315A Software Installation Instruction
No ratings yet
UT315A Software Installation Instruction
4 pages
Unit - 6 Promotion Decisions: Jacqueline
No ratings yet
Unit - 6 Promotion Decisions: Jacqueline
22 pages
At Home and Abroad
No ratings yet
At Home and Abroad
6 pages
Formulation Development
No ratings yet
Formulation Development
1 page
Sistem Reproduksi Wanita
No ratings yet
Sistem Reproduksi Wanita
24 pages
AN-1525 Single Supply Operation of The DAC0800 and DAC0802: Application Report
No ratings yet
AN-1525 Single Supply Operation of The DAC0800 and DAC0802: Application Report
7 pages
Summative Test 1 (Gen. Physics 2)
No ratings yet
Summative Test 1 (Gen. Physics 2)
1 page
LG Dry Contact (Only AC 24V) : Installation Manual
No ratings yet
LG Dry Contact (Only AC 24V) : Installation Manual
11 pages
Straightforward A2 - Unit 1 - Mini Test
No ratings yet
Straightforward A2 - Unit 1 - Mini Test
4 pages
Int J Mental Health Nurs - 2003 - Happell - Burnout and Job Satisfaction A Comparative Study of Psychiatric Nurses From
No ratings yet
Int J Mental Health Nurs - 2003 - Happell - Burnout and Job Satisfaction A Comparative Study of Psychiatric Nurses From
9 pages
Deloitte Mergers Aquisitons Tax
No ratings yet
Deloitte Mergers Aquisitons Tax
1 page
Domain-Specific Languages in R: Advanced Statistical Programming
From Everand
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
No ratings yet
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Statistical Analysis Techniques in Particle Physics: Fits, Density Estimation and Supervised Learning
From Everand
Statistical Analysis Techniques in Particle Physics: Fits, Density Estimation and Supervised Learning
Ilya Narsky
No ratings yet
Mastering the Nmap Scripting Engine
From Everand
Mastering the Nmap Scripting Engine
Paulino Calderon Pale
No ratings yet
Mastering Python in 7 Days
From Everand
Mastering Python in 7 Days
Alex Wood
No ratings yet
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
Mastering Python Programming: From Basics to Expert Proficiency
From Everand
Mastering Python Programming: From Basics to Expert Proficiency
William Smith
No ratings yet
Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
Building a BeagleBone Black Super Cluster
From Everand
Building a BeagleBone Black Super Cluster
Andreas Josef Reichel
No ratings yet
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
Distributed Computing with Python
From Everand
Distributed Computing with Python
Francesco Pierfederici
No ratings yet
Haskell Design Patterns
From Everand
Haskell Design Patterns
Lemmer Ryan
No ratings yet
Build Supercomputers with Raspberry Pi 3
From Everand
Build Supercomputers with Raspberry Pi 3
Carlos R. Morrison
No ratings yet
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Mastering Python: A Comprehensive Crash Course for Beginners
From Everand
Mastering Python: A Comprehensive Crash Course for Beginners
Kameron Hussain
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
Large Scale Machine Learning with Python
From Everand
Large Scale Machine Learning with Python
Bastiaan Sjardin
2/5 (1)

NLP Final Report

Uploaded by

NLP Final Report

Uploaded by

CSE 447 Project

Joseph Schafer, Paul Druta, Spencer Lu

[4] Wikimedia Foundation. February 1, 2021. Wikimedia Downloads.

[5] Zhaofeng Wu. 2021. 03_sentiment_classification_example.ipynb.

[6] Victor Zhong. 2021. cse447-project. https://github.com/vzhong/cse447-project. (2021).

You might also like