0% found this document useful (0 votes)
70 views9 pages

A Comparative Study of Classifying Legal Documents With Neural Networks

This paper presents a system called the Supreme Court Classifier (SCC) that uses neural networks to classify U.S. Supreme Court legal opinions into categories from the Washington University School of Law Supreme Court Database (SCDB). The paper compares SCC's performance using convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to traditional machine learning baselines. SCC achieved its best performance of 72.4% accuracy when classifying documents into 15 broad SCDB categories using a CNN with pre-trained word embeddings.

Uploaded by

kaalchu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views9 pages

A Comparative Study of Classifying Legal Documents With Neural Networks

This paper presents a system called the Supreme Court Classifier (SCC) that uses neural networks to classify U.S. Supreme Court legal opinions into categories from the Washington University School of Law Supreme Court Database (SCDB). The paper compares SCC's performance using convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to traditional machine learning baselines. SCC achieved its best performance of 72.4% accuracy when classifying documents into 15 broad SCDB categories using a CNN with pre-trained word embeddings.

Uploaded by

kaalchu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/327561122

A Comparative Study of Classifying Legal Documents with Neural Networks

Conference Paper · September 2018


DOI: 10.15439/2018F227

CITATIONS READS

10 1,956

3 authors, including:

Samir Undavia John Evan Ortega


New York University University of Alicante
2 PUBLICATIONS   10 CITATIONS    17 PUBLICATIONS   48 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

FUSE Terminology Extraction View project

All content following this page was uploaded by Samir Undavia on 05 October 2018.

The user has requested enhancement of the downloaded file.


Proceedings of the Federated Conference on DOI: 10.15439/2018F227
Computer Science and Information Systems pp. 515–522 ISSN 2300-5963 ACSIS, Vol. 15

A Comparative Study of Classifying Legal


Documents with Neural Networks
Samir Undavia, Adam Meyers, John E. Ortega
New York University
60 5th Avenue
New York, New York 10011, USA
Email: {su478,meyers,jortega}@cs.nyu.edu

Abstract—In recent years, deep learning has shown promising example, aligning automatically defined categories with some
results when used in the field of natural language processing set of human categories will produce clearer results. Human-
(NLP). Neural networks (NNs) such as convolutional neural defined categories have names and notional definitions such as
networks (CNNs) and recurrent neural networks (RNNs) have
been used for various NLP tasks including sentiment analysis, Criminal Procedures, Civil Rights, and Federal Taxation [2].
information retrieval, and document classification. In this paper, In contrast, automatically classified categories are usually
we the present the Supreme Court Classifier (SCC), a system defined as sets of keywords or other more oblique definitions
that applies these methods to the problem of document clas- using words in the corpus. For instance, the case Roe v. Wade,
sification of legal court opinions. We compare methods using
410 U.S. 113 (1973)2 , may fit into a class labeled by a set of
traditional machine learning with recent NN-based methods. We
also present a CNN used with pre-trained word vectors which keywords like abortion, reproduction, medical, .... Although
shows improvements over the state-of-the-art applied to our these words describe the case, they do not correctly encap-
dataset. We train and evaluate our system using the Washington sulate the legal significance of the case. Roe v. Wade would
University School of Law Supreme Court Database (SCDB). Our be classified under the Privacy legal issue and the abortion:
best system (word2vec + CNN) achieves 72.4% accuracy when
including contraceptives sub-issue [2]. Unfortunately, it may
classifying the court decisions into 15 broad SCDB categories
and 31.9% accuracy when classifying among 279 finer-grained be difficult for a human to decide what the boundaries of
SCDB categories. the classes are defined by these keyword sets. On the other
hand, it may be possible to align the output of unsupervised
I. I NTRODUCTION classification with a manual set of categories. In fact, we do
Legal court opinions are lawful statements written by a this for our baseline system that uses latent Dirichlet allocation
judge providing the justification and legal reasoning for a court (LDA) to model documents and then a logistic regression (LR)
ruling. This paper describes an automated document classi- [3] model to align the results with the manual categories (see
fication model implemented as our Supreme Court Classifier Section IV-A).
(SCC) system. In theory, SCC could make obsolete many time- While categories can be extracted in many ways, we believe
consuming manual tasks requiring legal experts. A legal expert that some sort of human validation is preferable. Moreover,
would need to read hundreds or thousands of documents in the legal domain often requires documents to be aligned with
order to place opinions into subject categories, whereas an human-defined legal categories for a majority of legal tasks.
automatic system like SCC could do this with little or no For this reason, we have implemented a system that uses pre-
human effort.1 defined document categories (from human evaluators) to train
Some document classification efforts, particularly, those a model for classifying legal texts. Specifically, SCC automat-
using unsupervised approaches, evaluate output based on hu- ically classifies legal opinions from cases seen by the Supreme
man evaluation of automatically derived categories. However, Court of the United States (SCOTUS), manually classified
when automatic document classification is based on human- into topics as part of the Supreme Court Database (SCDB) by
defined categories, the results are, arguably, more "natural." Washington University School of Law [2]. Henceforth, we will
Evaluation tends to be more straightforward with human- refer to these as the SCDB categories. The SCDB categories
annotated classifications because it is usually easy for a are defined in Section III-A.
human being to tell whether or not a document belongs to Our work systematically tests the application of a variety
a human-defined category. In contrast, this determination is of machine learning (ML) techniques to automate semantic
harder to make with purely unsupervised methods (e.g., topic classification of SCOTUS legal opinions. Our most successful
modeling [1]), unless a manual component is added. For systems are based on neural networks (NNs). NNs are used
to solve a variety of natural language processing (NLP) tasks
1 SCC can be downloaded from https://github.com/samir1/web_of_law_
because of their ability to extract relevant information from
scotus_classification/ under an Apache 2.0 license (https://www.apache.org/
licenses/LICENSE-2.0). In addition to computer code, the repository includes
our training/development/test split of the SCDB data, ensuring that our results 2 Roe v. Wade is a supreme court case that is famous in the United States
are both reproducible and comparable to future work. for causing abortion to become legal.

IEEE Catalog Number: CFP1885N-ART c 2018, PTI 515


516 PROCEEDINGS OF THE FEDCSIS. POZNAŃ, 2018

text without having to specify features for any particular recent automated document classification techniques that do
domain [4]. We examine two common NN architectures: the not rely on significant pre-processing and human interaction.
convolutional neural network (CNN) [5] and the recurrent Moreover, we present a comparison of different machine
neural network (RNN) [6], much like the medical text clas- learning techniques in order to determine methods with the
sification experiments using CNNs conducted by Hughes et highest performance for our task.
al. [7]. Initially used for classifying images, variations of the
Our work on SCC is similar to the approach of Wood et
original CNN architecture are used for NLP tasks [8]. Two
al. [1] for classifying medical summaries in that we model
main variations of the RNN, long short-term memory (LSTM)
our corpus using LDA and classify with pre-defined labels
[9] and gated recurrent unit (GRU) [10], have recent successful
(see Section IV-A). In that study, an initial topic model was
results in their application to sequence modeling [11], [12].
derived from some training documents. Then the topic model
We compare a series of different CNN and RNN architec-
was modified with pre-labeled data in order to classify the new
tures to documents represented by word embeddings. We show
data. Likewise, we use a combination of LDA and pre-labeled
that our CNN performs better with the legal corpus than the
legal opinions to create our baseline classification system. We
other models implemented. SCC uses neural networks to select
compare the results of our NN-based classification results to
one of the SCDB categories for each supreme court case in
our application of LDA and an LR classification.
our validation corpus.
In this paper, we use CNNs and RNNs to classify legal Domain-specific automated document classification has
documents with minimal pre-processing, in contrast to other been applied to several fields, including electronic medical
machine learning approaches (e.g., support vector machines) records. Hughes et al. [7] use convolutional neural networks
that require manually specifying features for classification to detect features for sentence-level classification of medical
or manually determining key words [13]. Any additional texts, resulting in a much smaller input compared to our exper-
pre-processing that could potentially improve performance iments. Unlike SCC, in which we feed an entire document into
requires manual editing since each document contains slightly a neural network, the Hughes et al. [7] model classifies texts
different formatting resulting from OCR errors from scans of by first transposing each document into a matrix of sentences
printed documents. We measure the efficacy of NN classifica- with fixed lengths. Their model also differs from ours in that
tion techniques applied to our corpus and show that our CNN their model is essentially two sets of two stacked convolutional
outperforms RNNs for legal text classification based on SCDB layers followed by a pooling layer, whereas our model does
categories (see Table I in Section V). In order to apply our not have any consecutive convolutional layers. Additionally,
classification models to text, we first represent each word in one of our experiments (following Hughes et al. [7]) tests
our corpus as a word embedding (vector representations of the effectiveness of using doc2vec embeddings with an LR
words were generated using an unsupervised learning method model as the classifier. Similarly, Weng et al. [18] use medical
from Mikolov et al. [14]). We use the publicly available pre- texts as the subject of their classification task, although they
trained word vectors trained on about 100 billion words from use a different neural network architecture to classify health
part of the Google News dataset.3 We use 300-dimensional documents. Weng et al. [18] apply a complex combination of
word2vec vectors trained using the skip-gram architecture with CNN and RNN architectures to clinical note text classification;
negative sampling by Mikolov et al. [14]. We map each word their model is summarized by three convolutional and pooling
to a word vector and use neural network classifiers on the layers followed by a bidirectional LSTM [18]. Moreover, Yin
dataset. Additionally, we present results from using two other et al. [19] present a comparison of different neural network
word embedding models, fastText [15] and GloVe [16], with architectures used to complete a variety of NLP tasks. Such
our CNN (our best system). We describe the neural neural tasks include sentiment analysis, document classification, and
network models in Section IV and results in Table I. part-of-speech tagging. In their text classification experiment,
Yin et al. [19] use a pre-labeled set of 10,717 sentences
II. R ELATED W ORK evenly distributed over 19 labels, compared to our unevenly
We have found a relatively small body of previous work distributed dataset, in which a third of the categories have
about automatic text classification of legal documents. For under one hundred examples and four classes have over 1,000
example, support vector machines (SVMs) have been used documents. In contrast, our experiments aim to solve the
to classify legal documents like legal docket entries [17] and specific problem of document classification applied to legal
to classify non-English legal texts [4]. Although our work texts. Moreover, Kim [20] describes a general CNN used to
also examines the application of machine learning to a corpus classify sentences with word2vec word embeddings. Similar to
written in the legal context, we focus on classifying SCOTUS the model we propose, Kim’s model also includes three con-
legal opinions without significant pre-processing. For example, volutional and pooling layers. We optimize hyperparameters
the Nallapati and Manning [17] system includes several steps and experiment with a combination of different convolutional,
of pre-processing before using an SVM to classify documents pooling, and dropout layers. We compare applications of the
with human-selected features and labels. We explore more Kim [20] and Hughes et al. [7] models to our legal corpus.
We ultimately obtain improved results by using the customized
3 https://code.google.com/archive/p/word2vec/ CNN on the SCOTUS legal opinion corpus.
SAMIR UNDAVIA ET AL.: A COMPARATIVE STUDY OF CLASSIFYING LEGAL DOCUMENTS 517

III. E XPERIMENTAL S ETUP IV. M ETHODS


A. Dataset A. LDA + Logistic Regression
Before the widespread use of neural networks for NLP
tasks, probabilistic methods like LDA [21], [22], were used
as a standard for a variety of NLP tasks including text
classification. We use LDA as a baseline to compare the results
of the NN-based classification models. Our process involves
using LDA to represent each of the heavily pre-processed
legal documents as a series of latent feature vectors. LDA
is most commonly used to generate a collection of latent
topics for a corpus, and then calculate the probability of a
document belonging to a topic. We use the Gensim5 library
to create and train the LDA model. We classify vectors from
the implementation of this model using LR.
Fig. 1: 15 legal issues The LDA Bayesian probabilistic model is an unsupervised
machine learning method used to organize documents through
topic modeling. In this model, each document is represented
as a probability distribution over latent topics. These topics
are derived from the assumption that the document’s words
themselves, modeled as a term frequency-inverse document
frequency (tf-idf) matrix, with words represented using the
bag-of-words (BoW) model, are distributed over latent topics
as defined by the distribution of words in the corpus.6
After latent feature vectors are generated to describe each
of the documents, we apply an LR to classify the documents
into 15 legal issues and 279 legal subtopics. As in Wood et al.
[1], we use a combination of LDA and pre-defined labels with
corresponding documents to create our baseline classification
system.
Fig. 2: 279 subtopics
B. Doc2vec + Logistic Regression
We train and test our system on the manually-categorized Our first method of document classification using deep
SCOTUS legal opinion (Supreme Court Database or SCDB) learning involves a higher-level application of word2vec. As
corpus, from Washington University School of Law [2]. The described in [23], paragraph vectors, often referred to as
dataset consists of 8419 US Supreme Court court opinions doc2vec or document vectors, can be used to map semantic
from "modern" cases (1946-2016), organized into 15 legal meaning from a variable-length document to a fixed-size vec-
categories (Figure 1), which are further divided into 279 tor. As in the word2vec learning method, a word is predicted
subtopics (Figure 2). We chose the modern dataset because by its neighboring words. The significant difference between
both court opinions and pre-defined labels were available the Distributed Memory Model of Paragraph Vectors and other
through the textacy Python package.4 Textacy provides a for- similar learning techniques is that an additional paragraph
matted version of modern cases from FindLaw’s US Supreme token (treated like an additional word in the document) is
Court legal opinions database. The 8419 documents were used when learning the paragraph vector. Next, we classify
randomly divided into training, validation, and test sets with a the documents into both 15 and 279 classes using a logistic
80%/10%/10% split. Although the SCDB labels also covered regression.
"legacy" cases (1791-1945), FindLaw’s database only reliably
provided case text from the "modern" era of US law. C. Bag-of-Words + Support Vector Machine
As another baseline, we represent documents using the BoW
B. Initial Processing model and apply an SVM using Scikit-learn’s SVM package7
Our system first removes (as noise) special characters that with default parameters. We chose SVM as a baseline because
refer to footnotes. We also removed a number of characters it is one of the highest performing traditional ML methods,
used in formatting the original printed document. Next, each 5 https://radimrehurek.com/gensim/
word in the corpus was mapped to a word2vec vector before 6 In order to find the ideal number of topics for the LDA-based classification,
being fed into a neural network for classification. we conducted the experiment with different numbers of topics ranging from
100 to 600 in steps of 100 and and chose 300 topics because there was not
4 http://textacy.readthedocs.io/en/latest/_modules/textacy/datasets/supreme_ a noticeable improvement in performance with more than 300 topics.
court.html 7 http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
518 PROCEEDINGS OF THE FEDCSIS. POZNAŃ, 2018

used for lots of different tasks. Similarly, Kim [20] also uses LSTM layer consisting of 128 units, and a dropout of 0.5 to
an SVM benchmark. prevent overfitting. Lastly, we had a dense layer representing
the number of labels for the experiment.
D. Word2vec + CNN
For our neural network classification approach, we designed G. Word2vec + GRU
a multi-layer model similar to the one described by Kim [20], Our last experiment involves the memory-enhanced GRU
but with additional layers and modified hyperparameters. [10], a variation of the RNN. The GRU is described by the
Our model first creates an embedding layer using pre-trained following equations:
word2vec word embeddings, and then creates a matrix of
documents represented by 300-dimensional word embeddings.
We include three sets of the following: a dropout of 0.25, a z = (xt Uz + ht z
1W ) (6)
convolution layer of 128 filters with a filter size of 5, and max
r = (xt Ur + ht r
1W ) (7)
pooling layer with a pooling size of 5. We also add a dense
layer consisting of 128 units between two dropouts of 0.5 to st = tanh(xt Us + (ht 1 r)Ws ) (8)
prevent overfitting. The last layer is a dense layer with size
equal to the number of labels for the test (15 or 279). ht = (1 z) st + z ht 1 (9)

E. Other Embeddings
In addition to using word2vec embeddings with the CNN,
where xt represents an input vector x at time t, r is
we conduct the same experiments with two other pre-trained
the reset gate, and z is the update gate. The input weights are
word embeddings: fastText vectors8 from Facebook AI Re-
defined by W and recurrent weights by U.
search (FAIR) [15] and GloVe vectors9 from Pennington et
As with our CNN and LSTM models, our application of
al. [16]. The pre-trained 300-dimensional fastText vectors are
the GRU begins with a word2vec embedding layer. Next, we
trained on Wikipedia using the skip-gram model described in
include a GRU layer of size 128 and a dropout of 0.5 before
Bojanowski et al. [15]. The pre-trained 300-dimensional GloVe
the final dense layer for classification.
vectors are trained on Wikipedia and the Gigaword 5 dataset
using GloVe model [16]. H. Hyperparameters and Regularization
We use the publicly available pre-trained word vectors In our experiments, we tested our model with a series of
trained on about 100 billion words from part of the Google different hyperparameters and found that our best NN systems
News dataset.10 We use 300-dimensional word2vec vectors use 128 units for the RNNs and 128 filters for each of the
trained using the skip-gram architecture with negative sam- convolutional layers in the CNN. For both these settings, we
pling by Mikolov et al. [14]. tried values of 32, 64, 128 and 256 and 128 gave the best
F. Word2vec + LSTM results. Basically, the 128 gave better results than the lower
settings and it turned out that the 256 setting could not be
One of the RNN-based networks we used to classify our run effectively when training with an Nvidia GPU. It seems
legal corpus is the LSTM, which is defined by these equations: that additional GPU memory would be required (or a more
efficient algorithm) to use 256 units. It is probable that 128 is
it = (xt Ui + ht 1W
i
+ bi ) (1) simply the largest (power of 2) setting that is practical to use
f
ft = (xt U + ht 1W
f
+ bf ) (2) given the available equipment. This seems to be supported by
the fact that many other NN systems (e.g. Kim [20]) use a
ot = (xt Uo + ht 1W
o
+ bo ) (3) value around 100.
Additionally, each of the models are regularized with a
ct = ft ct 1 + it tanh(xt Uc + ht 1W
c
+ bc ) (4)
dropout [24], which works by "dropping out" a proportion
ht = ot tanh(ct ) (5) p of hidden units during training. We found that a dropout of
0.5 before the final dense layer and batch size of 32 worked
In this model, xt represents an input x at time t. The three best for the LSTM, GRU, and CNN. We also found that the
gates of the LSTM are the input gate it , forget gate ft and Adam optimizer [25] worked best for both the for CNN and
output gate ot . ct is the memory cell state. ht is the hidden RNN networks.
state. The input weights are defined by W, recurrent weights
by U, and bias by b. V. R ESULTS
Our implementation of the LSTM consisted of the em- Our goal is to determine the best method for applying
bedding layer formed with pre-trained word2vec vectors, an automated document classification to legal texts with the
8 https://github.com/facebookresearch/fastText/blob/master/pretrained-
hopes of facilitating legal experts in their classification of
vectors.md
court documents. Our experiments look not only at comparing
9 https://nlp.stanford.edu/projects/glove/ existing networks, but also at developing our own superior
10 https://code.google.com/archive/p/word2vec/ network. As shown in Figure I, our CNN model achieves the
SAMIR UNDAVIA ET AL.: A COMPARATIVE STUDY OF CLASSIFYING LEGAL DOCUMENTS 519

highest accuracy, both for the 15-label and 279-label tasks a more detailed analysis on the development set results, rather
(72.4% and 31.9% accuracies, respectively). We present an than the test set because we do not want to examine the test
analysis of our results. results too closely and bias our future work. It is clear that
the model’s accuracy tends to be higher for the most frequent
TABLE I: Classification Accuracy Results on the Test Set categories. Categories 1, 2, 3, 8 and 9, all of which are labels
Model 15 labels 279 labels on more than 50 documents, mostly have f-measures of over
Word2vec + CNN 72.4 31.9 70%, with one outlier at 56%. It is difficult to generalize about
fastText + CNN 67.3 25.1
GloVe + CNN 67.1 17.7 the least frequent categories (frequency < 10), including labels
Word2vec + GRU 68.6 14.5 0, 5, 6, 11, 13 and 14, as there is too little data to analyze.
Word2vec + LSTM 43.8 6.5 Some of these have f-measures of 0 or near 0, and on average,
Word2vec + CNN (Kim [20]) 65.9 14.7
Word2vec + CNN (Hughes et al. [7]) 54.7 19.8
they do much worse than the high-frequency categories. This
LDA + LogR [21] 40.3 13.4 is expected since the high-frequency categories have more
Doc2vec + LogR [23] 54.1 28.6 training data and thus provide more evidence for the model
BoW + SVM 64.0 30.5
to build on. Thus, as expected, cases with correct labels of 1,
The accuracy results of correctly classifying documents with our
CNN, LSTM, and GRU models compared to other classification methods.
2, 3, 8 and 9 tend to be classified correctly and the categories
Our CNN best overcomes the problem of an uneven distribution of classes, with little to no training data (0, 5, 6, 11, 13, 14) are most
as shown in Figures 1 and 2. often misclassified.
On the other hand, category labels 4, 7, 10, and 12 each
TABLE II: CNN Results on the Development Set by Category have a similar moderate number of test documents (around 30
documents), but have very different results: the model achieves
Label Precision Recall F-Score # of Docs relatively high results for labels 7 and 12 and relatively poor
0 - None 0.33 0.33 0.33 3
1 - Criminal Procedure 0.82 0.91 0.86 182 results for 4 and 10. Thus it would seem that the results for
2 - Civil Rights 0.72 0.70 0.71 138 labels 4, 7, 10 and 12 cannot be explained purely on the basis
3 - First Amendment 0.87 0.79 0.82 84 of frequency. Figure 3 is a confusion matrix for our CNN
4 - Due Process 0.45 0.33 0.38 30
5 - Privacy 0.56 0.62 0.59 8 results on the development/validation set. We observe some
6 - Attorneys 0.33 0.40 0.36 5 patterns which may help us understand these results. For labels
7 - Unions 0.73 0.76 0.75 29 7 and 12, where the model performs well, the correct category
8 - Economic Activity 0.72 0.72 0.72 172
9 - Judicial Power 0.57 0.56 0.56 116 clearly dominates–no other category is marked for more than
10 - Federalism 0.41 0.41 0.41 34 4 documents. However, the poorly performing categories, each
11 - Interstate Relations 0.67 0.67 0.67 6 have a second (or third) dominant category in addition to the
12 - Federal Taxation 0.90 0.79 0.84 33
13 - Miscellaneous 0.50 0.50 0.50 2 correct one. Label 4 (Due Process) is applied to 10 true Due
14 - Private Action 0.00 0.00 0.00 0 Process cases and incorrectly classifies 7 as Civil Rights cases,
Avg/Total 0.71 0.72 0.71 842 6 as Economic Activity cases and 4 as Criminal Procedure
The relation between frequency and f-measure for the development cases and another 3 miscellaneous erroneous labels. To the
set.
extent that a case may be given multiple classifications (Due
Process and Civil Rights) or (Due Process and Economic
TABLE III: CNN Results on the Test Set by Category Activity), these errors are understandable and may even reflect
Label Precision Recall F-Score # of Docs a defect in the experiment–perhaps cases should have multiple
0 - None 0.20 1.00 0.33 1 classifications and the one classification per case assumption
1 - Criminal Procedure 0.81 0.85 0.83 183 is unrealistic. Similarly, label 10 (Federalism) is applied 14
2 - Civil Rights 0.77 0.81 0.79 121
3 - First Amendment 0.79 0.88 0.83 56 times correctly to Federalism cases and 11 times incorrectly
4 - Due Process 0.48 0.30 0.37 33 to Economic Activity cases (and rarely to other categories). It
5 - Privacy 0.57 0.44 0.50 9 is expected that some federalist issues (issues about the power
6 - Attorneys 0.80 0.73 0.76 11
7 - Unions 0.77 0.70 0.73 33 of the federal government) will overlap with economic issues.
8 - Economic Activity 0.72 0.74 0.73 145 So these may also be understandable errors.
9 - Judicial Power 0.56 0.54 0.55 102 We now attempt to better understand these errors, focusing
10 - Federalism 0.48 0.33 0.39 33
11 - Interstate Relations 0.50 0.40 0.44 5 our error analysis on Federalism classification. We examine
12 - Federal Taxation 0.86 0.96 0.91 25 three samples from our development set, each sample con-
13 - Miscellaneous 0.00 0.00 0.00 1 sisted of four cases. We look at 4 cases that are correctly
14 - Private Action 0.00 0.00 0.00 0
Avg/Total 0.72 0.72 0.72 758 classified by our CNN as Federalism cases (True_Fed); 4
The relation between frequency and f-measure for the test set. cases that were correctly classified as Economic Activity cases
(True_Eco); and 4 Federalism cases that our system misclassi-
Tables II and III show the CNN’s (our best system) perfor- fied as Economic Activity cases (False_Fed). We compare the
mance on individual classes. Although the details are slightly False_Fed cases to both True_Fed and True_Eco. Our goal is
different (e.g., a different number of documents for each to understand the sort of factors that might cause a human
category), the relative scores are about the same. We now do or a machine learning algorithm to mis-classify the False_Fed
520 PROCEEDINGS OF THE FEDCSIS. POZNAŃ, 2018

documents.
The True_Fed cases include Testa et al. v. Katt; Bethlehem
Steel Co. et al. v. New York State Labor Relations Board;
Rice et al. v. Santa Fe Elevator Corp. et al.; and Rice et al. v.
Board of Trade of the City of Chicago. These all involved the
interaction of state and federal authorities and laws, including
questions of whether a state authority should be compelled
to enforce a federal law or whether a state agency/law takes
precedence over a federal agency/law.
The 4 True_Eco cases included Halliburton Oil Well Ce-
menting Co. v. Walker et al.; Champlin Refining Co. v. United
States et al.; United States v. Howard P. Foley Co., Inc.;
and Richfield Oil Corp. v. State Board of Equalization. The
Halliburton case is a patent dispute. The Foley case is about
the government’s liability in a contract dispute. The Champlin
case examines whether the Interstate Commerce Commission,
a federal entity, could require information from an oil refining
company operating across several states. While similar to the
True_Fed cases in some ways, there is no conflict between a
state and a federal authority. The Richfield case is a dispute
about whether a state sales tax applies to a sale to a foreign
government. This seems similar to Federalist concerns, but
there is no conflict between a state and federal authority.
Rather the concern is whether or not a state sales tax effects
a possibly-foreign transaction. Fig. 3: Confusion Matrix for CNN Development Corpus
The False_Fed cases include Phillips Chemical Co. v. Du-
mas Independent School District; Panhandle Eastern Pipe Line
Co. v. Michigan Public Service Commission et al.; Wyeth v.
Diana Levine; and North Dakota v. United States. These cases
all involve financial transactions, state authorities and federal
authorities. The topics covered includes: the legitimacy of state
taxes on federal land leased to a company; state regulation
of alcohol and other goods procured for sale on a federal
military base; liability of a drug company (in state civil court)
for damages from harm by their drug, even though the FDA,
a federal agency, granted them clearance for the drug; and
whether the sale of natural gas was subject to state regulation,
in spite of a federal law licensing the sale. While some of
these issues seem to include federal/state authority conflicts,
those conflicts are not as clearly articulated as in the True_Fed
cases. So it is clear how experienced annotators may be able to
consistently distinguish the Federalism and Economic Activity
classes. However, we would imagine that inexperienced an-
notators may have trouble.11 Similarly, machine learning may
require more evidence (more documents) to correctly classify
these cases.
The sparsity and imbalanced classes of the dataset presented
itself as the most challenging obstacle for training the neural
networks. For instance, nearly three fourths of all documents
fell under 4 of the 15 legal area categories (Criminal Pro-
cedure, Civil Rights, Economic Activity, and Judicial Power).
Not only was there not an even distribution of documents over
each of the labels, many of the classes had little to no training Fig. 4: Normalized Confusion Matrix for CNN Development
data. Furthermore, our input sequence length was several Corpus

11 We would expect lower inter-annotator agreement on these sort of cases.


SAMIR UNDAVIA ET AL.: A COMPARATIVE STUDY OF CLASSIFYING LEGAL DOCUMENTS 521

orders of magnitude larger than the inputs in experiments


conducted by Kim [20] and Hughes et al. [7].
Due to adjustments we made for the dataset and config-
uration changes (see Section IV), our CNN performs better
than other CNN models (see Table I). The adjusted parameters
include dropouts to account for overfitting that occurs earlier
in training and the addition of extra layers.
After generating a model for each of our NN-based tests,
we use them for our entire corpus and analyze the documents
that are misclassified in order to find patterns in the way each
of NN architectures complete the classification task. Figure
5 shows a combination of the normalized confusion matrices
resulting from classifying our entire corpus with our CNN
and the simple RNN models, and each of the true label rows
describe the fraction of documents classified per predicted
label. As in Figure 5, the CNN performs the best for each
of the categories, not just the top four labels (although a small
number of errors made by the CNN were with documents
from these classes). Unlike our CNN, the classifications from
the GRU (our second best system) are more scattered. Fig-
ure 5 shows some of labels the GRU most frequently mis-
classifies. In contrast to the CNN and LSTM, the GRU tends
to significantly mis-classify documents from both frequent and Fig. 5: Merged Confusion Matrix for CNN, GRU, and LSTM
infrequent categories, as in the frequent category of Criminal
Procedure and the uncommon category of First Amendment.
The GRU also does not classify entire categories correctly, size of the dataset.
whereas the CNN had high classification accuracy for every
category. For example, the GRU mis-classifies every legal Our results also show that the simple BoW+SVM model
opinion in the categories: Interstate Relations, Miscellaneous, performs very well for both classification tasks. As Nallapati
and Private Action. Additionally, there was no single label L, and Manning [17] mention, "the SVM assigns high weights to
such that our GRU system correctly classified more documents many spurious features owing to their strong correlation with
in class L than our CNN system. the class" [17, p. 442]. In other words, even very infrequent
It seems that the relatively low frequencies of some of the words that have very high correlations with certain classes
categories is more of a challenge to some of the learning would help the SVM associate certain words with uncommon
algorithms than others. In particular, CNN and GRU appear classes. This aspect of the SVM seems to explain why the
to be somewhat more resilient to this effect, than LSTM, SVM performed well with the 279-label classification task
as evidence by the merged confusion matrices shown in (nearly as well as the best system), in which only a few
Figure 5. The LSTM incorrectly classified a majority of the documents define each category.
documents to one of the two most frequent labels (Criminal In our results, the word2vec model out-performs the simpler
Procedure and Economic Activity), as shown in Figure 5. bag-of-words model. With more statistical information, the
Despite categorizing almost all of the documents to only two classifiers find common features and patterns to describe
legal issues, the LSTM did not have a higher accuracy for categories with higher accuracy. The positive results from our
those two labels. In fact, the LSTM did worse than our LDA experiment in which we apply an LR to paragraph vectors
baseline system (see Table I). This was somewhat surprising (doc2vec) show how well the embeddings capture the meaning
considering that LSTMs often perform similarly to GRUs for of the documents (refer to Table I).
a given task. We plan to investigate this further in future The simple GRU network has promising results compared
research, possibly trying additional models. to the CNNs because the GRU is designed to handle long
As in Chung et al. [11], we test the performance of LSTMs sequences. While a word may carry a large weight in most
and GRUs. While we apply these models to categorizing legal contexts, the GRU allows for a word’s weight to diminish
text, their application was music transcription. Our results based on specific examples. Yin et al. [19] shows that the
show that the structure of the simpler GRU leads to greater accuracy of the CNN decreases as sequence length increases
accuracy compared to the LSTM when classifying a relatively and eventually falls under the accuracy of the GRU. Since
small number of documents over a large number of labels. our experiments involve inputting entire documents instead of
The GRU performs better than the LSTM with document- sentences, sequence lengths are orders of magnitude larger
length sequences. The LSTM tends to remember the wrong than than those used in the experiments conducted by Yin et
information needed for the classification because of the small al. [19], Kim [20], and Hughes et al. [7].
522 PROCEEDINGS OF THE FEDCSIS. POZNAŃ, 2018

VI. C ONCLUSION AND F UTURE W ORK [10] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio,
“On the properties of neural machine translation: Encoder-decoder
In this paper, we find the best method for automated legal approaches,” CoRR, vol. abs/1409.1259, 2014. [Online]. Available:
document classification is the SCC system that uses a CNN http://arxiv.org/abs/1409.1259
(72.4% accuracy for 15 general categories and 31.9% accuracy [11] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation
of gated recurrent neural networks on sequence modeling,” CoRR, vol.
for the 279 more specific categories). On the other hand, abs/1412.3555, 2014. [Online]. Available: http://arxiv.org/abs/1412.3555
the GRU architecture shows promising results compared to [12] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation
our tuned CNN (nearly as high for the 15 category task). of generic convolutional and recurrent networks for sequence
modeling,” CoRR, vol. abs/1803.01271, 2018. [Online]. Available:
We believe that a tuned GRU-based network can potentially http://arxiv.org/abs/1803.01271
complete the task with higher accuracy. [13] F. Sebastiani, “Machine learning in automated text categorization,”
The SCC system uses word embeddings from a general ACM Comput. Surv., vol. 34, no. 1, pp. 1–47, Mar. 2002. [Online].
Available: http://doi.acm.org/10.1145/505282.505283
domain (Google News). It is possible that embeddings from [14] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed
the legal domain would improve results. Accordingly, we plan representations of words and phrases and their compositionality,” CoRR,
to compile a much larger corpus of US legal opinions from vol. abs/1310.4546, 2013. [Online]. Available: http://arxiv.org/abs/1310.
4546
appellate and local courts in order to generate domain-specific [15] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
word embeddings for our model. We will conduct experiments vectors with subword information,” Transactions of the Association for
using these embeddings instead of the Google News embed- Computational Linguistics, vol. 5, pp. 135–146, 2017.
[16] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
dings used for the results reported here, or possibly in addition. for word representation,” in In EMNLP, 2014.
In future work, we plan to investigate the reasons behind the [17] R. Nallapati and C. D. Manning, “Legal docket-entry classification:
substantial difference between the performance of the LSTM Where machine learning stumbles,” in Proceedings of the Conference
on Empirical Methods in Natural Language Processing, ser. EMNLP
and GRU. Moreover, we believe that an application of transfer ’08. Stroudsburg, PA, USA: Association for Computational Linguistics,
learning as shown in [26] could be used to train classifiers 2008, pp. 438–446. [Online]. Available: http://dl.acm.org/citation.cfm?
for more specific topics and different subdomains of the legal id=1613715.1613771
[18] W.-H. Weng, K. B. Wagholikar, A. T. McCray, P. Szolovits, and H. C.
field. Chueh, “Medical subdomain classification of clinical notes using a
machine learning-based natural language processing approach,” in BMC
R EFERENCES Medical Informatics and Decision Making, 2017.
[1] J. Wood, “Source-lda: Enhancing probabilistic topic models using [19] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of CNN
prior knowledge sources,” CoRR, vol. abs/1606.00577, 2016. [Online]. and RNN for natural language processing,” CoRR, vol. abs/1702.01923,
Available: http://arxiv.org/abs/1606.00577 2017. [Online]. Available: http://arxiv.org/abs/1702.01923
[2] H. J. Spaeth, L. Epstein, A. D. Martin, J. A. Segal, T. J. Ruger, and [20] Y. Kim, “Convolutional neural networks for sentence classification,”
S. C. Benesh, “2017 supreme court database, version 2017 release 01,” CoRR, vol. abs/1408.5882, 2014. [Online]. Available: http://arxiv.org/
2017. [Online]. Available: http://supremecourtdatabase.org abs/1408.5882
[3] D. R. Cox, “The regression analysis of binary sequences,” Journal of [21] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
the Royal Statistical Society. Series B (Methodological), pp. 215–242, J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. [Online].
1958. Available: http://dl.acm.org/citation.cfm?id=944919.944937
[4] O. Sulea, M. Zampieri, S. Malmasi, M. Vela, L. P. Dinu, and [22] D. M. Blei, “Probabilistic topic models,” Commun. ACM, vol. 55,
J. van Genabith, “Exploring the use of text classification in the no. 4, pp. 77–84, Apr. 2012. [Online]. Available: http://doi.acm.org/10.
legal domain,” CoRR, vol. abs/1710.09306, 2017. [Online]. Available: 1145/2133806.2133826
http://arxiv.org/abs/1710.09306 [23] Q. V. Le and T. Mikolov, “Distributed representations of sentences
[5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning and documents,” CoRR, vol. abs/1405.4053, 2014. [Online]. Available:
applied to document recognition,” in Proceedings of the IEEE, 1998, pp. http://arxiv.org/abs/1405.4053
2278–2324. [24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[6] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, R. Salakhutdinov, “Dropout: A simple way to prevent neural
no. 2, pp. 179–211, 1990. networks from overfitting,” Journal of Machine Learning
[7] M. Hughes, I. Li, S. Kotoulas, and T. Suzumura, “Medical Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available:
text classification using convolutional neural networks,” CoRR, vol. http://jmlr.org/papers/v15/srivastava14a.html
abs/1704.06841, 2017. [Online]. Available: http://arxiv.org/abs/1704. [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic
06841 optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:
[8] A. Conneau, H. Schwenk, L. Barrault, and Y. LeCun, “Very deep http://arxiv.org/abs/1412.6980
convolutional networks for natural language processing,” CoRR, vol. [26] C. B. Do and A. Y. Ng, “Transfer learning for text classification,”
abs/1606.01781, 2016. [Online]. Available: http://arxiv.org/abs/1606. in Proceedings of the 18th International Conference on Neural
01781 Information Processing Systems, ser. NIPS’05. Cambridge, MA,
[9] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural USA: MIT Press, 2005, pp. 299–306. [Online]. Available: http:
Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: //dl.acm.org/citation.cfm?id=2976248.2976286
http://dx.doi.org/10.1162/neco.1997.9.8.1735

View publication stats

You might also like