0% found this document useful (0 votes)
258 views2 pages

NLP Final Report

The document summarizes a CSE 447 project that used a combination of a recurrent neural network and a bi-character probabilistic model to select the top 3 characters for a given input. The models were trained on preprocessed Wikipedia data from 10 languages. An RNN with embeddings, hidden layers, and learning rate was used to get the top 50 characters, then a bi-character model selected the top 3 based on bigram probabilities. The project relied on libraries like PyTorch and used starter code from a Github project.

Uploaded by

api-436856993
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
258 views2 pages

NLP Final Report

The document summarizes a CSE 447 project that used a combination of a recurrent neural network and a bi-character probabilistic model to select the top 3 characters for a given input. The models were trained on preprocessed Wikipedia data from 10 languages. An RNN with embeddings, hidden layers, and learning rate was used to get the top 50 characters, then a bi-character model selected the top 3 based on bigram probabilities. The project relied on libraries like PyTorch and used starter code from a Github project.

Uploaded by

api-436856993
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

CSE 447 Project

Joseph Schafer, Paul Druta, Spencer Lu


Approach:
For our classifier, we used a combination of two models - a recurrent neural network, and a bi-
character probabilistic model. In order to select our top 3 characters, we first processed our input through
a recurrent neural network with 1 RNN layer, 64-dimensional embeddings, 128-dimensional hidden
layers, a batch size of 256, and a learning rate of .001 to get the top 50 most probable characters. Then,
once these characters are determined, we select the top 3 among these based off of our bi-character
probabilities, given the single most recent character of the input. The rationale behind this design decision
is that the bigram model is able to learn common pairs of tokens while the RNN model is able to account
for longer context. Both of these models are based off of the language modeling lecture, and the
corresponding bigram and RNN assignments in A2 and A3 [3].
Dataset:
To train our models, we downloaded a subset of Wikipedia data, from a selection of 10 - German,
English, Spanish, French, Italian, Japanese, Polish, Portuguese, Russian, and Mandarin Chinese [4]. We
selected these languages as they are the 10 featured languages on the homepage of Wikipedia, implying
that these languages’ Wikipedias are considered to be of comparatively high quality. Additionally, the
combination of these languages allow for some cross-language learning similarities (e.g. the large amount
of similarity between French, Spanish, Italian, and Portuguese, or between German and English), while
also allowing our model to learn some about other languages which treat characters far differently, like
Mandarin Chinese.
Since the download was of a format containing the entire article along with its links and other
markup syntax unique to Wikipedia, we then pre-processed this data by, among other things: removing
file links, and links to other pages; removing symbols surrounding headers; removing indicators for lists
of categories, and tables of contents; and removing XML markup and comments. For input into our
recurrent neural network, we considered one input per paragraph we processed in the wikipedia files,
while we looked at every bi-character pair in these samples (every pair within the input, as well as the last
pair of the input with the output) for our bi-character model. For the final checkpoint, we also trimmed
down our data size significantly, in order to remove a few more
Wikipedia-specific artifacts, but our initial language
distribution of samples (across train, dev, and test) was
depicted in chart 1, and while the raw numbers would change
in our final filtration step we significant distribution changes.
Our final training data sample consisted of 19,013,053 samples
for our recurrent neural network, and 780,322,237
corresponding training samples for our bi-character model.
References/Used work:
To build our RNN, we relied heavily on an adapted version of the sentiment classifier, developed
by and demonstrated in section of week 4 by the 447 TAs [5]. Additionally, our code is reliant on the
collections, os, random, xml, torch, tqdm, pickle, string, and argparse libraries. All of these libraries
except for tqdm and torch (the PyTorch library) are default libraries included with a python installation.
PyTorch was developed by Adam Paszke et. al., while tqdm was developed by Casper da Costa-Luis and
Stephen Karl Larroque [2, 1]. We also used the starter code from the given Github project also developed
by the 447 TAs [6].
References:

[1] Casper da Costa-Luis and Stephen Karl Larroque. 2021. tqdm. https://github.com/tqdm/tqdm. (2021).

[2] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang,
Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning
Library. In NeurIPS. 8024-8035. (2021).

[3] Noah Smith. 2021. Natural Language Processing (CSE 517 & 447): (Neural) Language Models.
https://drive.google.com/file/d/15xk-qyd3DFBLBYlTBDegfuZJKElJxuk4/view. (2021).

[4] Wikimedia Foundation. February 1, 2021. Wikimedia Downloads.


https://dumps.wikimedia.org/backup-index.html. (February 8, 2021).

[5] Zhaofeng Wu. 2021. 03_sentiment_classification_example.ipynb.


https://colab.research.google.com/drive/14GAMb7c6FbDnhWvqcliCZ8KYNvqdnQz7?usp=sharing.
(2021).

[6] Victor Zhong. 2021. cse447-project. https://github.com/vzhong/cse447-project. (2021).

You might also like