0% found this document useful (0 votes)

15 views24 pages

Text Classification With Deep Learning - Code

The document discusses text classification in Natural Language Processing, focusing on authorship attribution using Deep Learning techniques. It outlines the approach of building a linguistic style model to extract author-specific features and a classifier to determine the authors of disputed texts, specifically the Federalist Papers. The methodology includes preprocessing text data, using LSTM for feature extraction, and a feed-forward network for classification.

Uploaded by

ashish99998

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views24 pages

Text Classification With Deep Learning - Code

Uploaded by

ashish99998

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Text Classification with Deep Learning 12/11/19, 1:24 AM

(https://www.nvidia.com/dli/)

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 1 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

Introduction
Text classification is a classic problem in Natural Language Processing. Given multiple
individual spans of texts (sentences, paragraphs, documents, etc.), the task is to assign each
span one or multiple labels - or classes - out of k possible ones. Some possible applications
of text classification are:

Genre identification - Does this text contain news, sports, finance, etc?
Language detection - Is this text in English, German or any other language?
Sentiment analysis - What type of sentiment (positive/negative/neutral) is present in this
text? Additionally, if multiple subjects or topics are discussed, what sentiment is
associated with each such subject/topic?

A particular type of text classification is the problem of authorship attribution. In this task we
are given some documents and a set of possible authors. We then assign each document
with a set of the authors who we believe wrote the document. Presumably, we also have a
large set of documents for which we know the authors already so we can extract features and
characteristics to help us with the unknown ones.

Authorship attribution is a well-studied problem which led to the field of Stylometry

(https://en.wikipedia.org/wiki/Stylometry). As with many other NLP problems, it has benefited
greatly from the increase in available computer power, data and advanced machine learning
techniques. All of these make authorship attribution a natural candidate for the use of deep
learning (DL). In particular, we can benefit from DL's ability to automatically extract the
relevant features for a specific problem.

In this lab we will focus on the following:

1. Building a linguistic style model to extract author-specific features from a set of texts
(known as a corpus)
2. Using these features for building a classification model for authorship attribution
3. Applying the model for identifying the author of a set of unknown documents

The Federalist Papers

The Federalist Papers (https://en.wikipedia.org/wiki/The_Federalist_Papers)[1] are a set of
essays written between 1787 and 1788 by Alexander Hamilton
(https://en.wikipedia.org/wiki/Alexander_Hamilton), James Madison
(https://en.wikipedia.org/wiki/James_Madison) and John Jay
(https://en.wikipedia.org/wiki/John_Jay). Initially published under the pseudonym 'Publius',
their intent was to encourage the ratification of the then-new Constitution of the United States

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 2 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

of America. In later years, a list emerged where the author of each one of the 85 papers was
identified. Nevertheless, for a subset of these papers the author is still in question. The
problem of the Federalist Papers authorship attribution has been a subject of much research
in statistical NLP in the past (see the above Wikipedia article for details). We will try to use
Deep Learning to re-create this research.

In concrete terms, the problem is identifying - for each one of the disputed papers - whether
Alexander Hamilton (AH) or James Madison (JM) are the authors. We will assume that each
paper has a single author (i.e., that no collaboration took place) and that each author has a
well-defined writing style that is displayed across all the papers.

Approach
We will take the following approach with this problem:

Use the non-disputed documents as labeled data for an end-to-end model. The model is
composed of two distinct parts (see Figure 1 below):

1. A linguistic style feature extractor

2. A classifier

Use the model to determine the author for each disputed paper

Figure 1: Author Attribution Model

Let us now examine each model component in turn.

Style Extraction
http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 3 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

Given a sequence of tokens (subsets of the input text), we would like the extractor to return a
representation of this sequence such that sequences with similar stylistic qualities have
similar representations. In other words, we would like to find a mapping from the sequence to
a vector space that uses style properties as its basis. Some possible properties that the
extractor may learn include (but are not limited to):

Where, when and how often punctuation is used

Distribution of sentence length
The use of specific vocabulary and syntactic constructions

The above features are well-known in stylometry. We would like, however, for the model to
learn both these and other features that may be applicable.

In order to use sequences we will need models that can deal with time-varying inputs (i.e.,
multiple timesteps with diﬀerent inputs at each timestep). In our case, we will have one token
every timestep and a fixed sequence length. These will be our hyperparameters - a set of
parameters that we determine empirically and that do not change during the training. The
model itself will be a Recurrent Neural Net (RNN) and specifically, a variant named Long Short
Term Memory (LSTM) (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) that is
ideally suited for NLP problems (see Figure 2).

Figure 2: LSTM Cell

Ideally, our feature extractor will be a type of language model. That is, given a set of tokens
the model can predict the next token with high accuracy. Let's look at an example:

(1) The cat sat on the X

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 4 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

Looking at (1), we can make some assumptions about the token X:

It is likely a noun (or more precisely, a Noun Phrase

(https://en.wikipedia.org/wiki/Noun_phrase)). Otherwise, (1) may not be a grammatically-
correct sentence.
It is likely some piece of furniture. Otherwise, there would not be agreement with the
semantics of the verb sat.

If we have such a model available - namely, one that predicts sequences that appear in our
training data with high probability - the natural conclusion is that is has learned how our
authors tend to write and we can then attempt to use that knowledge. Here we encounter our
primary issue: training language models on words requires potentially millions of examples.
We have a relatively small corpus, so a word language model may severely overfit (in other
words, this model may pick up specific patterns that just happen to appear in our corpus).
This is a tough issue - fortunately, in this case it is easy to overcome. Since our corpus is
composed of proper English text (i.e., no foreign characters or emojis) we can simply use
characters as the tokens and not words. For example, given the phrase "to the people of New
York", we will have the sequence: ['t', 'o', ' ', 't', 'h', 'e', ' ', 'p',...] Note that whitespace and
punctuation are also characters and will be part of the sequences. This will hopefully assist us
in learning the features mentioned above.

Another trick we can use is to use character embeddings as opposed to just the characters
themselves (for instance, in a 1-hot encoded representation). Recall that embeddings are
dense representations of data that can learn the semantics of the domain. Here the
terminology may be a bit confusing, but the final result is that the embedding of a single
character can represent features about the context of the character: which characters tend to
appear before and/or after this one. This is precisely what we need.

So the structure of our style feature encoder is as follows:

A sequence of characters (length of sequence is a hyperparameter)

An embedding layer for characters (dimensionality of the embedding is a hyperparameter)
LSTM layer for learning a sequence representation (dimensionality of LSTM output is a
hyperparameter). We will use the output of the LSTM at the end of the sequence as our
output. This is known as a context vector (or state vector) - it encodes details about a
particular sequence of characters that can be used by the classifier.

Classifier
As noted above, the output of the style feature extractor (or feature encoder) is a fixed-size
vector. We use this vector as the input to a simple multi-layer feed-forward network. Each
layer in the network can then extract features that help it determine whether the character
sequence (that is, the fixed-size representation) was written by AH or JM. The final layer is
composed of a single neuron with a sigmoid (https://en.wikipedia.org/wiki/Sigmoid_function)

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 5 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

activation, so that the entire model's output y is constrained to be 0 ≤ y ≤ 1 . Hence, the

output can be interpreted as the probability that y = 1 or equivalently - given that we
arbitrarily assigned the value 1 to JM and 0 to AH - that the sequence was written by JM. This
is easily extendable to multiple (k > 2 ) authors by using k neurons in the final layer with a
softmax (https://en.wikipedia.org/wiki/Softmax_function) activation.

A very useful property of the entire model (both the feature encoder and the classifier) is that it
can be trained end-to-end - there is no need for specifically building either the feature
encoder or classifier. Furthermore, it also means that we can use this model to easily infer the
author of each of the disputed papers. For each such document, we perform the following
procedure:

1. Break the entire document to sequences of the same length, as determined by the
hyperparameter
2. Retrieve an author prediction for each one of these sequences
3. Determine which author has received more 'votes'. We will then use this author as our
prediction for the entire document. (Note: in order to have a clear majority, we need to
ensure that the number of sequences is odd).

Building the Model

We begin by setting up the data pre-processing pipeline. For each one of the authors, we
aggregate all the known papers into a single long text. Recall that we assume that style does
not change across the various papers, hence a single text is equivalent to multiple small ones
yet it is much easier to deal with programmatically.

For each paper of each author we perform the following steps:

Convert all text into lower-case (ignoring the fact that capitalization may be a stylistic
property)
Converting all newlines and multiple whitespaces into single whitespaces
Remove any mention of the authors' names, otherwise we risk data leakage
(https://machinelearningmastery.com/data-leakage-machine-learning/)

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 6 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

In [17]: import numpy as np

import os
from sklearn.model_selection import train_test_split

# Constants for Hamilton/Madison

AH = 0
JM = 1
UNKNOWN = -1

def preprocess_text(file_path):
""" Read and preprocess the text from a specific file.
Preprocessing includes:
* Replace newlines by spaces
* Replace double spaces by single spaces
* Lower-cases the text
* Removes the names of the authors

# Arguments
file_path: the path to read the file from

# Returns
The preprocessed file

"""
with open(file_path, 'r') as f:
lines = f.readlines()
text = ' '.join(lines[1:]).replace("\n", ' ').replace(' ',' ').
text = ' '.join(text.split())
return text

# Concatenate all the papers known to be written by Hamilton/Madisson into

all_hamilton, all_madison = '',''
for x in os.listdir('./federalist_papers/AH/'):
all_hamilton += preprocess_text('./federalist_papers/AH/' + x)

for x in os.listdir('./federalist_papers/JM/'):
all_madison += preprocess_text('./federalist_papers/JM/' + x)

# Print lengths of the large texts

print("Hamilton text length: {}".format(len(all_hamilton)))
print("Madison text length: {}".format(len(all_madison)))

Hamilton text length: 672168

Madison text length: 271345

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 7 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

Note that there is much more text available for AH than JM. We will need to address this issue
in order to not bias the model towards AH.

The next step is to break the long text for each author into many small sequences. As
described above, we empirically choose a length for the sequence and use it throughout the
model's lifecycle. We get our full dataset by labeling each sequence with its author.

To break the long texts into smaller sequences we use the Tokenizer class from the Keras
framework. In particular, note that we set it up to tokenize according to characters and not
words.

In [18]: from keras.preprocessing.text import Tokenizer

# Hyperparameter - sequence length to use for the model

SEQ_LEN = 30

def make_subsequences(long_sequence, label, sequence_length=SEQ_LEN):

""" Breaks a large sequence into multiple smaller sequences of specifie

# Arguments
long_sequence: the long sequence to break into smaller sequences
label: the label to assign to each subsequence
sequence_length: the length of each subsequence

# Returns
X: matrix of size [len_sequences - sequence_length, sequence_length
y: matrix of size [len_sequences - sequence_length, 1] with label d

"""
len_sequences = len(long_sequence)
X = np.zeros(((len_sequences - sequence_length)+1, sequence_length))
y = np.zeros((X.shape[0], 1))
for i in range(X.shape[0]):
X[i] = long_sequence[i:i+sequence_length]
y[i] = label
return X,y

# We use the Tokenizer class from Keras to convert the long texts into a se

tokenizer = Tokenizer(char_level=True)

# Make sure to fit all characters in texts from both authors

tokenizer.fit_on_texts(all_hamilton + all_madison)

madison_long_sequence = tokenizer.texts_to_sequences([all_madison])[0]
hamilton_long_sequence = tokenizer.texts_to_sequences([all_hamilton])[0]

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 8 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

# Convert the long sequences into sequence and label pairs

X_madison, y_madison = make_subsequences(madison_long_sequence, JM)
X_hamilton, y_hamilton = make_subsequences(hamilton_long_sequence, AH)

# Print sizes of available data

print("Number of characters: {}".format(len(tokenizer.word_index)))
print('Madison sequences: {}'.format(X_madison.shape))
print('Hamilton sequences: {}'.format(X_hamilton.shape))

Number of characters: 53
Madison sequences: (271316, 30)
Hamilton sequences: (672139, 30)

Compare the number of raw characters to the number of labeled sequences for each author.
Deep Learning requires many examples of each input. The following code calculates the
number of total and unique words in the texts.

In [19]: # Calculate the number of unique words in the text

word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts([all_madison, all_hamilton])

print("Total word count: ", len((all_madison + ' ' + all_hamilton).split

print("Total number of unique words: ", len(word_tokenizer.word_index))

Total word count: 158673

Total number of unique words: 8250

Exercise: Do you think a word or a character embedding model is appropriate here? Write
down your reasoning in the following cell.

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 9 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

We now proceed to create our train, validation and test sets.

We begin by addressing the discrepancy in the amounts of data available for AH vs. JM. We
choose a simple solution here by choosing the same number of sequences for AH as are
available for JM and discarding the rest. Depending on the performance of the model, this
may or may not be a good idea in general.

We then create the various datasets:

The training set is used by the model to learn the weights in the neural network. The
model will iterate over this data many times, until performance is deemed to be
acceptable. A single pass through all the data is known as an epoch. Each training loop
works on a subset of the data known as a mini-batch. The number of instances in this
mini-batch is known as the batch size.
The validation set is used at the end of each epoch to assess performance of the model.
We present the model with data it has not seen before in order to evaluate its ability to
generalize. Had we used the training set instead, the model would have no 'motivation' to
learn the internal structure of the data - it would just try to 'memorize' the original data.
We stop the training when the validation set performance begins to drop, as this means
that the model now specializes (i.e, overfits) on the training set and is losing its ability to
deal with unseen data.
The test set is the final measure of performance that we report for the model. Once again,
we feed the model with data that it has not seen before in order to see how well it can
generalize. We do not use the validation set as we have already used it to determine
when to stop training, so eﬀectively our model is biased towards good validation
performance. The test set is a brand new set of data that should only be used at the end
of the model training procedure.

We take 80% of the original data for the training set, and use the remaining 20% for test. We
then split the resulting training set again, and use 90% for actual training and the other 10%
for validation.

Exercise: Make sure that the data is in the proper shape for use in an RNN. See here for a
hint.

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 10 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

In [20]:
# Take equal amounts of sequences from both authors
X = np.vstack((X_madison, X_hamilton[:X_madison.shape[0]]))
y = np.vstack((y_madison, y_hamilton[:y_madison.shape[0]]))

# Break data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.8)

# Data is to be fed into RNN - ensure that the actual data is of size [batc
X_train = X_train.reshape(-1,SEQ_LEN,1)##TODO## : Reshape the data to fit a
X_test = X_test.reshape(-1,SEQ_LEN,1)##TODO## : Reshape the data to fit an

# Break train set into train and validation

X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_trai

# Print the shapes of the train, validation and test sets

print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))

print("X_validate shape: {}".format(X_validate.shape))

print("y_validate shape: {}".format(y_validate.shape))

print("X_test shape: {}".format(X_test.shape))

print("y_test shape: {}".format(y_test.shape))

X_train shape: (390694, 30, 1)

y_train shape: (390694, 1)
X_validate shape: (43411, 30, 1)
y_validate shape: (43411, 1)
X_test shape: (108527, 30, 1)
y_test shape: (108527, 1)

Finally, we construct the model graph and perform the training procedure. Notice how each
part of the model we describe above is implemented in Tensorflow code.

A single training epoch takes around 8.5 minutes on a K80 GPU. We have therefore provided
pretrained weights for the model at 1, 10 and 20 epochs. Simply run the code with no
changes to use the pretrained weights. If you'd like to perform the training yourself, change
the value of the RUN_TRAINING variable in the second cell below to True. You can also
control the number of training epochs using the NUM_EPOCHS variable.

Exercise: Complete the code for the tensorflow layers below.

See hint for the embedding layer

See hint for the LSTM layer

In [21]: import tensorflow as tf

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…task2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 11 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

In [21]: import tensorflow as tf

def build_model(sequences, embedding_size=100, lstm_size=256):

""" Build the Tensorflow graph that implements the neural author attrib

# Arguments
sequences: character sequence data of size (batch size, sequence le
embedding_size: size of embedding vector to be generated for each c
lstm_size: size of vector that will be output by the LSTM style mod

# Returns
result: output of the entire model, a value between 0 and 1
lstm_output: the last output of the LSTM for each input sequence

"""

# Sequences need to be of type integer

sequences = tf.cast(sequences, tf.int32)

# Placeholder for determining train or test mode

with tf.variable_scope("Options"):
mode = tf.placeholder(tf.string, (), "mode")
is_training = tf.equal("train", mode)

# Layer for selecting character embeddings

with tf.variable_scope("Embeddings"):
embeddings = tf.get_variable("embeddings", (len(tokenizer.word_inde
initializer=tf.initializers.random_unif
result = tf.nn.embedding_lookup(embeddings, sequences)##TODO## : Ad

# Style model implemented using LSTM

with tf.variable_scope("StyleModel"):

# We use an LSTM in lieu of a generic RNN

lstm = tf.contrib.rnn.LSTMCell(lstm_size)##TODO## : Add code for LS

result = tf.reshape(result, (-1, SEQ_LEN, embedding_size))

result = tf.unstack(result, SEQ_LEN, 1) # TF-specific way of feed
result, _ = tf.contrib.rnn.static_rnn(lstm, result, dtype=tf.float3
lstm_value = result[-1] # Only take the last resu

# Fully-connected classification model on top of the LSTM output

with tf.variable_scope("Hidden1"):
w1 = tf.get_variable("w1", (lstm_size, 128), initializer=tf.initial
b1 = tf.get_variable("b1", 128)
result = tf.nn.relu(tf.matmul(lstm_value, w1) + b1)

# Dropout should only be applied during training

result = tf.layers.dropout(result, rate=0.4, training=is_training

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 12 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

with tf.variable_scope("Hidden2"):
w2 = tf.get_variable("w2", (128, 64), initializer=tf.initializers
b2 = tf.get_variable("b2", 64)
result = tf.nn.relu(tf.matmul(result, w2) + b2)

# Dropout should only be applied during training

result = tf.layers.dropout(result, rate=0.3, training=is_training

with tf.variable_scope("Output"):
w3 = tf.get_variable("w3", (64, 1), initializer=tf.initializers.
b3 = tf.get_variable("b3", 1)
result = tf.nn.sigmoid(tf.matmul(result, w3) + b3)

return result, lstm_value

def build_loss(logits, labels):

""" Build the graph operations for calculating loss

# Arguments
logits: output value of the model, as a logit
labels: real labels of the data

# Returns
loss: the loss value
num_correct: number of instances that were correctly classified
"""
loss = tf.losses.log_loss(labels, logits)

preds = tf.round(logits)
equality = tf.equal(tf.cast(labels, tf.float32), preds)
num_correct = tf.reduce_sum(tf.cast(equality, tf.float32))

return loss, num_correct

def build_training(loss, learning_rate=0.001):

""" Build the graph operations for performing training

# Arguments
loss: loss value for the model
learning_rate: the learning rate to use for the training procedure
"""
return tf.train.RMSPropOptimizer(learning_rate=learning_rate, momentum

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 13 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

The following cell contains the code to perform the training loop on our data. Note the use of
Tensorflow iterators for importing diﬀerent datasets without changes to the code.

In [22]: import time

### Hyperparameters
BATCH_SIZE = 4096
NUM_EPOCHS = 20 # Change this to shorten training time at the expense o

def make_dataset(X,y):
""" Creates a dataset composed of (data, label) instances, to be used f

# Arguments
X: the data to be used for training
y: the labels to be used for training

# Returns
ds: a Dataset object to be used for creating iterators
"""
ds = tf.data.Dataset.zip(
(tf.data.Dataset.from_tensor_slices(X), tf.data.
).shuffle(len(X), reshuffle_each_iteration=True).
return ds

def evaluate(dataset):
""" Perform evaluation of the model given a specific dataset

# Arguments
dataset: the dataset to be used for the evaluation

# Returns
mean of the loss value over all batches in this dataset
accuracy score for the dataset

"""

# Initialize an iterator from the input dataset

sess.run(data_iterator.make_initializer(dataset))

total_inputs = 0;
total_correct = 0;
losses = []
try:

# Iterate over all batches in the iterator - an exception will be t

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 14 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

# Iterate over all batches in the iterator - an exception will be t

# data is available.
while True:
logits_value, loss_value, value_correct = sess.run([logits,
feed_dict={"Options/mode:0":"test"})

total_inputs += logits_value.shape[0]
total_correct += value_correct
losses.append(loss_value)
except tf.errors.OutOfRangeError:
# This exception is expected. Simply continue.
pass

return np.mean(losses), total_correct / total_inputs

def train():
""" Perform a single training epoch of a model

# Returns
mean of the training loss value over all batches in this dataset
accuracy score for the dataset
duration: time elapsed for performing a single epoch

"""

# Initialize an iterator from the training dataset

sess.run(data_iterator.make_initializer(train_ds))

losses = []
duration = 0

accuracy = 0
start = time.time()
try:

# Iterate over all batches in the iterator - an exception will be t

# data is available.
while True:
loss_value, _ = sess.run([loss, training_step], feed_dict={"Opt
losses.append(loss_value)
except tf.errors.OutOfRangeError:
# This exception is expected. Simply continue.
duration = time.time() - start

return np.mean(losses), duration

def validate():
""" Evaluate a validation set on a model

# Returns
http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 15 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

# Returns
Results of evaluating a validation set on a model
"""
return evaluate(validate_ds)

def test():
""" Evaluate a test set on a model

# Returns
Results of evaluating a test set on a model
"""
return evaluate(test_ds)

tf.reset_default_graph()
g = tf.Graph()
with g.as_default():
with tf.name_scope("input"):

# Build training, validation and test datasets of (data, label)

train_ds = make_dataset(X_train, y_train)
validate_ds = make_dataset(X_validate, y_validate)
test_ds = make_dataset(X_test, y_test)

# Build iterator of a specific shape (to be used for multiple d

data_iterator = tf.data.Iterator.from_structure(train_ds.output

# A single iterator will return batches of data and labels

next_sequence, next_label = data_iterator.get_next()

# Build a model along with its loss and training operators

logits, lstm_output = build_model(next_sequence)
loss, num_correct = build_loss(logits, next_label)
training_step = build_training(loss)

init_op = tf.group(tf.global_variables_initializer(), tf.local_variable

saver = tf.train.Saver(max_to_keep=NUM_EPOCHS)

sess = tf.Session(graph=g)

############# Flag for either training or restoring model from file ###
RUN_TRAINING = False

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 16 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

if not RUN_TRAINING:
saver.restore(sess, "/dli/data/checkpoints/model.ckpt-20"); # Avail
print("Restored model from file!")

else:

sess.run(init_op)

# Run training and validation epochs

for epoch in range(1, NUM_EPOCHS+1):
mean_train_loss, duration = train()
mean_val_loss, accuracy = validate()
save_path = saver.save(sess, "/dli/data/checkpoints/model.ckpt"
print("Epoch {0} ({1:.3f} seconds), training loss {2:.5f}, vali
.format(epoch, duration, mean_train_loss, mean_val_loss

mean_test_loss, accuracy = test()

print("Test loss: {0:.5f}, test accuracy: {1:.3f}".format(mean_tes

INFO:tensorflow:Restoring parameters from /dli/data/checkpoints/mode

l.ckpt-20
Restored model from file!

Once the model has finished training, compare the validation and test losses to the training
loss. Notice that the test and validation loss values are similar, but not identical. This indicates
that the validation set is a good approximation for the performance of the test set and the
model's ability to generalize to new, unseen data. Notice also that the model's training loss is
lower that both validation and test loss. This may indicate that the model is beginning to
overfit by modelling the 'noise' in the training data. A good rule-of-thumb is to stop the
training when the validation loss begins to rise while the training loss continues to drop.

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 17 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

Evaluating the Model

Once we have a trained model, we can proceed to evaluate the model's performance. We will
perform this evaluation in two ways:

1. Check the output of the style feature encoder - does it coincide with our intuition of what
it should be doing?
2. Compare the output of the classifier with the published research on the disputed papers -
how accurate is our model?

Observing the Style Features Encoder

As described above, the output of the feature encoder is a fixed-size vector. The dimension of
this vector is determined by the lstm_size parameter in the build_model function above. It is
interesting to see what insights we can get from looking at this vector and how it applies to
the overall model's performance.

In the next cell, we create a new dataset that uses the test set and shuﬄes it. Recall that the
test set contains data that has not been used during the model's training in any way. We
shuﬄe it to make sure that we get random instances and in particular, that we do not get
sequences that follow each other in the actual text. We then run a single batch of data
through the trained model. However, rather than look at the final output, we take the output
from the LSTM layer (which itself follows a character embedding layer). We also make sure to
include the real labels from this data. Run this cell multiple times to extract subsequent
batches for viewing.

In [23]: with g.as_default():

embeddings_ds = test_ds.shuffle(len(X_test))
sess.run(data_iterator.make_initializer(embeddings_ds))
lstm_vectors, real_labels = sess.run([lstm_output, next_label])

We would now like to get a visual idea of the style features vector. In order to do this, we need
to apply a Dimensionality Reduction (https://en.wikipedia.org/wiki/Dimensionality_reduction)
technique to transform the high-dimensional vector into a 2 or 3-dimensional vector which we
can then visualize. We do this with a technique called Principal Component Analysis (PCA)
(https://en.wikipedia.org/wiki/Principal_component_analysis) and then plot the result.

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 18 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

In [24]: %matplotlib inline

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from matplotlib import ticker

pca = PCA(n_components=2, random_state=12345)

transformed_values = pca.fit_transform(lstm_vectors)

plt.figure(figsize=(20,7))
plt.scatter(transformed_values[:,0], transformed_values[:,1], c=real_labels
colorbar = plt.colorbar();

def colorbar_labeler(value, _):

if value == float(AH):
return "AH"
elif value == float(JM):
return "JM"
return ""

colorbar.formatter = ticker.FuncFormatter(colorbar_labeler)
colorbar.update_ticks()

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 19 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

There are two interesting observations we can make from this plot:

1. The sequences in the test set seem to cluster together for each author, i.e. they are not
randomly placed in the figure. (If you have time, try reloading the model weights after a
single epoch and then re-running the plot.) This coincides with our hypothesis that each
author has a specific style and that the model can learn to identify it.
2. The two clusters have a certain amount of overlap. This is not surprising, as the authors
use many similar words and phrases.

Note, however, that the red cluster (AH) has a number of points (i.e., sequences) that seem to
lie deep in the yellow cluster (JM). Two possible interpretations for this phenomenon are as
follows:

1. Either AH occasionally uses vocabulary and style that closely resembles that of JM, or
2. The labels are not correct. Keep in mind that the data are just sequences from the text
and not the entire text itself, so labels refer to whether AH or JM wrote a particular
sequence. Hence, this may point to the fact that some papers are actually collaborations
between AH and JM. What do you think?

Try running this extraction and visualization code multiple times.

Optional: Using TSNE for Visualization

When applying PCA, we chose to use only the 2 most significant principal components which
only account for a small amount of variance in the data. Since we cannot plot more than 3
components, we need to use a diﬀerent method if we'd like a more detailed look. Here we
apply a method called T-distributed Stochastic Neighbor Embedding (T-SNE)
(https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding).

Because T-SNE is very computationally intensive (and unfortunately, we do not have a GPU
implementation) we take a diﬀerent approach: we apply PCA to reduce dimensionality of the
original vector, and apply T-SNE on the result.

In the following cell, change the NUM_DIMENSIONS parameter to control the amount of
variance plotted, and the NUM_ITERATIONS parameter to change the computation time of
the algorithm. Note that this may take several minutes to run.

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 20 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

In [25]: from sklearn.manifold import TSNE

NUM_DIMENSIONS = 30
NUM_ITERATIONS = 500 # No less than 250

pca = PCA(n_components=NUM_DIMENSIONS, random_state=12345)

pca_values = pca.fit_transform(lstm_vectors)
print("Explained variance: {}%".format(sum(pca.explained_variance_ratio_

tsne = TSNE(n_iter=NUM_ITERATIONS)
transformed_values = tsne.fit_transform(pca_values)

plt.figure(figsize=(20,7))
plt.scatter(transformed_values[:,0], transformed_values[:,1], c=real_labels
colorbar = plt.colorbar();

colorbar.formatter = ticker.FuncFormatter(colorbar_labeler)
colorbar.update_ticks()

Explained variance: 38.430435955524445%

Applying the Model to the Disputed Papers

We will now proceed to our original goal: run the model on the disputed papers and observe
the results. To do that, we simply repeat the procedure we performed for the training data: we
read the original text, preprocess it to remove capitalization and standardize whitespace and
convert to sequences. We then run all sequences through the model and tally the results - the
author with more votes is the 'winner'.

In [26]: for x in sorted(os.listdir('./federalist_papers/unknown/')):

disputed_text = preprocess_text('./federalist_papers/unknown/' + x)
disputed_long_sequence = tokenizer.texts_to_sequences([disputed_text
X_sequences, _ = make_subsequences(disputed_long_sequence, UNKNOWN)
http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 21 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

X_sequences, _ = make_subsequences(disputed_long_sequence, UNKNOWN)

X_sequences = X_sequences.reshape((-1,SEQ_LEN,1))

with g.as_default():

votes_for_madison = 0
votes_for_hamilton = 0

dataset = make_dataset(X_sequences, np.zeros((X_sequences.shape[

sess.run(data_iterator.make_initializer(dataset))

try:
while True:
predictions = sess.run(tf.round(logits), feed_dict={"Optio
counts = np.unique(predictions, return_counts=True)[1]
votes_for_hamilton += counts[AH]
votes_for_madison += counts[JM]

except tf.errors.OutOfRangeError:
pass

print("Paper {} is predicted to have been written by {}, {} to {}"

x.replace('paper_','').replace('.txt',''),
("Alexander Hamilton" if votes_for_hamilton > votes_for_mad
max(votes_for_hamilton, votes_for_madison), min(votes_for_h

Paper 49 is predicted to have been written by James Madison, 5506 to

4573
Paper 50 is predicted to have been written by James Madison, 3923 to
2982
Paper 51 is predicted to have been written by James Madison, 6640 to
5088
Paper 52 is predicted to have been written by James Madison, 5834 to
5268
Paper 53 is predicted to have been written by James Madison, 6675 to
6437
Paper 54 is predicted to have been written by James Madison, 6509 to
5373
Paper 55 is predicted to have been written by James Madison, 6840 to
5286
Paper 56 is predicted to have been written by James Madison, 4942 to
4675
Paper 57 is predicted to have been written by James Madison, 6662 to
6489
Paper 58 is predicted to have been written by James Madison, 6655 to
6070

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 22 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

Paper 62 is predicted to have been written by James Madison, 7885 to

6300
Paper 63 is predicted to have been written by James Madison, 10149 t
o 8154

Compare the results from our model to those listed here

(https://en.wikipedia.org/wiki/The_Federalist_Papers#Complete_list). How did we do?

In [31]: # Close Tensorflow session and free GPU resources

sess.close()

Summary
In this lab, we discussed the problem of authorship attribution. We presented the Federalist
Papers debate, and built a Deep Learning model to address it. Finally, we looked at the model
internals to get an intuition for how the it encodes stylometric properties.

For the Federalist Papers, we know for a fact that each debated paper was written by either
Alexander Hamilton or James Madison. Therefore we used a model that will output one or the
other. More generally, we could have used a model that returns confidence levels for each
author - we would then have multiple sigmoid outputs, each giving us the probability of the
input text having been written by an author. If no sigmoid exceeds a certain threshold (for
example, 0.5), we can declare that text's author as unknown.

Authorship attribution is a type of a text classification problem, which are very pervasive. The
models and approach you saw here are very relevant to many types of text classification.

Assessment: Categorize words

Modify Sense2Vec.py (../../../../edit/tasks/task2/task/Sense2Vec.py) by filling in all the
##TODO## in the file to categorize all the words into verbs and nouns.

Test your solution with the following command:

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 23 of 24
Text Classification with Deep Learning 12/11/19, 1:24 AM

In [33]: !python3 Sense2Vec.py

Epoch 105/170
- 0s - loss: 0.2509 - acc: 0.8542 - val_loss: 0.1478 - val_acc: 0.9
375
Epoch 106/170
- 0s - loss: 0.2402 - acc: 0.8750 - val_loss: 0.1555 - val_acc: 0.9
375
Epoch 107/170
- 0s - loss: 0.2415 - acc: 0.8681 - val_loss: 0.2248 - val_acc: 0.8
750
Epoch 108/170
- 0s - loss: 0.2868 - acc: 0.8611 - val_loss: 1.2011 - val_acc: 0.7
500
Epoch 109/170
- 0s - loss: 0.5090 - acc: 0.8264 - val_loss: 0.1561 - val_acc: 0.8
750
Epoch 110/170
- 0s - loss: 0.2371 - acc: 0.8889 - val_loss: 0.2822 - val_acc: 0.8
750
Epoch 111/170
- 0s - loss: 0.2663 - acc: 0.8681 - val_loss: 0.1813 - val_acc: 0.8
750
After completing this, go back in the browser page you used to open this notebook and
click the ASSESS TASK button. If you have categorized your words correctly then you
will pass this assessment.

In [*]: os. getcwd()

References
[1] The complete text of the Federalist Papers are available from Project Gutenberg here
(http://www.gutenberg.org/ebooks/18). The data has been split into multiple files for your
convenience, and those by John Jay have been removed.

http://ec2-18-222-127-142.us-east-2.compute.amazonaws.com/VjVvup…ask2/task/Text%20Classification%20with%20Deep%20Learning.ipynb# Page 24 of 24

Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
No ratings yet
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
275 pages
Unstyle: A Tool For Evading Authorship Attribution
No ratings yet
Unstyle: A Tool For Evading Authorship Attribution
80 pages
Full Text 01
No ratings yet
Full Text 01
54 pages
Intro Text Mining
No ratings yet
Intro Text Mining
83 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Classification Survey
No ratings yet
Classification Survey
40 pages
Mihai Surdeanu, Marco Antonio Valenzuela-Escarcega - Deep Learning For Natural Language Processing - A Gentle Introduction-Cambridge University Press (2024)
No ratings yet
Mihai Surdeanu, Marco Antonio Valenzuela-Escarcega - Deep Learning For Natural Language Processing - A Gentle Introduction-Cambridge University Press (2024)
345 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
Authorship Classification in A Resource Constraint Language Using Convolutional Neural Networks
No ratings yet
Authorship Classification in A Resource Constraint Language Using Convolutional Neural Networks
20 pages
Identifying Machine-Paraphrased Plagiarism: Bibtex Ris Enw
No ratings yet
Identifying Machine-Paraphrased Plagiarism: Bibtex Ris Enw
22 pages
Review of Text Classification Methods On Deep Learning
No ratings yet
Review of Text Classification Methods On Deep Learning
13 pages
Sequential Short-Text Classification With Recurrent and Convolutional Neural Networks
No ratings yet
Sequential Short-Text Classification With Recurrent and Convolutional Neural Networks
6 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
A Survey On Text Classification From Shallow To Deep Learning
No ratings yet
A Survey On Text Classification From Shallow To Deep Learning
21 pages
One-Class Learning For AI-Generated Essay Detection
No ratings yet
One-Class Learning For AI-Generated Essay Detection
24 pages
1 s2.0 S1877050922015058 Main
No ratings yet
1 s2.0 S1877050922015058 Main
11 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Generative and Discriminative Text Classification With Recurrent Neural Networks
No ratings yet
Generative and Discriminative Text Classification With Recurrent Neural Networks
9 pages
Expt 6
No ratings yet
Expt 6
19 pages
Unit 2
No ratings yet
Unit 2
26 pages
Wahle 2022 B
No ratings yet
Wahle 2022 B
23 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Trend
No ratings yet
Trend
47 pages
Microproject Report
No ratings yet
Microproject Report
23 pages
Description Based Text Classification With Reinforcement Learning
No ratings yet
Description Based Text Classification With Reinforcement Learning
16 pages
Enhancing Text Classification Through Novel Deep Learning Sequential Attention Fusion Architecture
No ratings yet
Enhancing Text Classification Through Novel Deep Learning Sequential Attention Fusion Architecture
12 pages
A Complete Process of Text Classification System Using State of The Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State of The Art NLP Models
26 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
Can Authorship Representation Learning Capture Stylistic Features?
No ratings yet
Can Authorship Representation Learning Capture Stylistic Features?
16 pages
NLP Module 3
No ratings yet
NLP Module 3
66 pages
Project Report
No ratings yet
Project Report
6 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Generative AI Text Classification Using Ensemble LLM Approaches
No ratings yet
Generative AI Text Classification Using Ensemble LLM Approaches
8 pages
Report XRNN
No ratings yet
Report XRNN
4 pages
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
No ratings yet
Report On Text Classification Using CNN, RNN & HAN - Jatana - Medium
15 pages
7-Text Classification-13-11-2024
No ratings yet
7-Text Classification-13-11-2024
53 pages
Talking Points
No ratings yet
Talking Points
8 pages
Group08 - BDM01 - Topic Modelling in Text Classification
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
19 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
UNIT 2 Sequence Labeling-1
No ratings yet
UNIT 2 Sequence Labeling-1
6 pages
DL Practical 09text Pre Processing
No ratings yet
DL Practical 09text Pre Processing
6 pages
Detecting Stylistic Fingerprints of Large Language
No ratings yet
Detecting Stylistic Fingerprints of Large Language
10 pages
Text Classification Reseach Paper
No ratings yet
Text Classification Reseach Paper
4 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
UNIT-III Text Classification
No ratings yet
UNIT-III Text Classification
4 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
Report
No ratings yet
Report
2 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
AP Government Unit 1 Packet
100% (2)
AP Government Unit 1 Packet
50 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
Unit-III NLP
No ratings yet
Unit-III NLP
15 pages
US History Packet 09-10
0% (1)
US History Packet 09-10
224 pages
American Story
100% (1)
American Story
413 pages
Unfinished Nation A Concise History of The American People 8th Edition Alan Brinkley Solutions Manual Download
100% (17)
Unfinished Nation A Concise History of The American People 8th Edition Alan Brinkley Solutions Manual Download
5 pages
Foundational Documents
No ratings yet
Foundational Documents
5 pages
The Best American History Essays 2006
100% (2)
The Best American History Essays 2006
306 pages
Day 4. Constitution I. STAAR Review PDF
100% (3)
Day 4. Constitution I. STAAR Review PDF
2 pages
Red 14th Amendment
No ratings yet
Red 14th Amendment
64 pages
Balkin - Constitutional Dictatorship - Its Dangers and Its Design
No ratings yet
Balkin - Constitutional Dictatorship - Its Dangers and Its Design
79 pages
Federalist 47 Thesis
100% (3)
Federalist 47 Thesis
8 pages
1500 Word Essay Pages
No ratings yet
1500 Word Essay Pages
51 pages
Federalists Vs Anti Federalists Essay
100% (2)
Federalists Vs Anti Federalists Essay
6 pages
Nixon v. United States, 506 U.S. 224 (1993)
No ratings yet
Nixon v. United States, 506 U.S. 224 (1993)
24 pages
Kernell Smith 19-41
No ratings yet
Kernell Smith 19-41
16 pages
The French Debate: Constitution and Revolution, 1795-1800 Marcus Ackroyd - The Ebook in PDF Format With All Chapters Is Ready For Download
No ratings yet
The French Debate: Constitution and Revolution, 1795-1800 Marcus Ackroyd - The Ebook in PDF Format With All Chapters Is Ready For Download
58 pages
Ebooks File Trump and Political Philosophy: Leadership, Statesmanship, and Tyranny Angel Jaramillo Torres All Chapters
100% (1)
Ebooks File Trump and Political Philosophy: Leadership, Statesmanship, and Tyranny Angel Jaramillo Torres All Chapters
65 pages
Mayflower Compact
100% (1)
Mayflower Compact
17 pages
Ecc 5 7735 9 e
No ratings yet
Ecc 5 7735 9 e
26 pages
Russell Hardin
100% (1)
Russell Hardin
14 pages
AP US Government Textbook
No ratings yet
AP US Government Textbook
425 pages
Unit 1: Foundations of A New Nation: Eighth Grade Social Studies: Integrated United States History
No ratings yet
Unit 1: Foundations of A New Nation: Eighth Grade Social Studies: Integrated United States History
5 pages
Federalist Paper 51 Summary
No ratings yet
Federalist Paper 51 Summary
2 pages
Cass Sunstein - Constitutions and Democracies An Epilogue
No ratings yet
Cass Sunstein - Constitutions and Democracies An Epilogue
30 pages
Kellerman Asks Us - How Bad Leadership Happens
No ratings yet
Kellerman Asks Us - How Bad Leadership Happens
6 pages
4 Ratifying The Constitution
No ratings yet
4 Ratifying The Constitution
4 pages
Federalist Papers THE FEDERALIST PAPERS
No ratings yet
Federalist Papers THE FEDERALIST PAPERS
7 pages
Federalist Papers 10 and 51 Excerpts
No ratings yet
Federalist Papers 10 and 51 Excerpts
4 pages
Kami Export - The Federalist Debate (HS)
No ratings yet
Kami Export - The Federalist Debate (HS)
5 pages
Syllabus
No ratings yet
Syllabus
7 pages
James Madison Republican or Democrat by Robert A Dahl
No ratings yet
James Madison Republican or Democrat by Robert A Dahl
11 pages

Text Classification With Deep Learning - Code

Uploaded by

Text Classification With Deep Learning - Code

Uploaded by

Text Classification with Deep Learning 12/11/19, 1:24 AM

Authorship attribution is a well-studied problem which led to the field of Stylometry

In this lab we will focus on the following:

The Federalist Papers

1. A linguistic style feature extractor

**Figure 1**: Author Attribution Model

Let us now examine each model component in turn.

Where, when and how often punctuation is used

**Figure 2**: LSTM Cell

(1) The cat sat on the X

Looking at (1), we can make some assumptions about the token X:

It is likely a noun (or more precisely, a Noun Phrase

So the structure of our style feature encoder is as follows:

A sequence of characters (length of sequence is a hyperparameter)

activation, so that the entire model's output y is constrained to be 0 ≤ y ≤ 1 . Hence, the

Building the Model

For each paper of each author we perform the following steps:

In [17]: import numpy as np

# Constants for Hamilton/Madison

# Concatenate all the papers known to be written by Hamilton/Madisson into

# Print lengths of the large texts

Hamilton text length: 672168

In [18]: from keras.preprocessing.text import Tokenizer

# Hyperparameter - sequence length to use for the model

def make_subsequences(long_sequence, label, sequence_length=SEQ_LEN):

# Make sure to fit all characters in texts from both authors

# Convert the long sequences into sequence and label pairs

# Print sizes of available data

In [19]: # Calculate the number of unique words in the text

print("Total word count: ", len((all_madison + ' ' + all_hamilton).split

Total word count: 158673

We now proceed to create our train, validation and test sets.

We then create the various datasets:

# Break data into train and test sets

# Break train set into train and validation

# Print the shapes of the train, validation and test sets

print("X_validate shape: {}".format(X_validate.shape))

print("X_test shape: {}".format(X_test.shape))

X_train shape: (390694, 30, 1)

Exercise: Complete the code for the tensorflow layers below.

See hint for the embedding layer

See hint for the LSTM layer

In [21]: import tensorflow as tf

In [21]: import tensorflow as tf

def build_model(sequences, embedding_size=100, lstm_size=256):

# Sequences need to be of type integer

# Placeholder for determining train or test mode

# Layer for selecting character embeddings

# Style model implemented using LSTM

# We use an LSTM in lieu of a generic RNN

result = tf.reshape(result, (-1, SEQ_LEN, embedding_size))

# Fully-connected classification model on top of the LSTM output

# Dropout should only be applied during training

# Dropout should only be applied during training

return result, lstm_value

def build_loss(logits, labels):

return loss, num_correct

def build_training(loss, learning_rate=0.001):

In [22]: import time

# Initialize an iterator from the input dataset

# Iterate over all batches in the iterator - an exception will be t

# Iterate over all batches in the iterator - an exception will be t

return np.mean(losses), total_correct / total_inputs

# Initialize an iterator from the training dataset

# Iterate over all batches in the iterator - an exception will be t

return np.mean(losses), duration

# Build training, validation and test datasets of (data, label)

# Build iterator of a specific shape (to be used for multiple d

# A single iterator will return batches of data and labels

# Build a model along with its loss and training operators

init_op = tf.group(tf.global_variables_initializer(), tf.local_variable

# Run training and validation epochs

mean_test_loss, accuracy = test()

INFO:tensorflow:Restoring parameters from /dli/data/checkpoints/mode

Evaluating the Model

Observing the Style Features Encoder

Figure 1: Author Attribution Model

Figure 2: LSTM Cell