0% found this document useful (0 votes)
2 views

Text representation_ from vector to tensor

This paper introduces the Tensor Space Model (TSM) for text representation, which utilizes high-order tensors instead of traditional vectors to capture word order information and improve text classification performance. The TSM is supported by the High-Order Singular Value Decomposition (HOSVD) for dimensionality reduction, demonstrating superior results compared to the Vector Space Model (VSM) on the 20 Newsgroups dataset. The findings indicate that TSM effectively addresses limitations of existing models while enhancing information retrieval capabilities.

Uploaded by

Tough Hou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Text representation_ from vector to tensor

This paper introduces the Tensor Space Model (TSM) for text representation, which utilizes high-order tensors instead of traditional vectors to capture word order information and improve text classification performance. The TSM is supported by the High-Order Singular Value Decomposition (HOSVD) for dimensionality reduction, demonstrating superior results compared to the Vector Space Model (VSM) on the 20 Newsgroups dataset. The findings indicate that TSM effectively addresses limitations of existing models while enhancing information retrieval capabilities.

Uploaded by

Tough Hou
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Text Representation: from Vector to Tensor*

Ning Liu1, Benyu Zhang2, Jun Yan3, Zheng Chen2, Wenyin Liu4, Fengshan Bai1, Leefeng Chien5
1
Department of Mathematical Science, Tsinghua University, Beijing 100084, P.R. China
{liun01, fbai}@mails.tsignhua.edu.cn
2
Microsoft Research Asia, 49 Zhichun Road, Beijing 100080, P.R. China
{byzhang, zhengc}@microsoft.com
3
School of Mathematical Sciences, Peking University, Beijing 100871, P.R. China
[email protected]
4
Department of Computer Science, City University of Hong Kong, P.R. China
[email protected]
5
Institute of Information Science, Academia Sinica
[email protected]

Abstract the document by TFIDF indexing [5]. However, the


major limitation of BOW is that it only retains the
In this paper, we propose a text representation frequency of the words in the document and loses the
model, Tensor Space Model (TSM), which models the sequence information.
text by multilinear algebraic high-order tensor instead In the past decade, attempts have been made to
of the traditional vector. Supported by techniques of incorporate the word-order knowledge with the vector
multilinear algebra, TSM offers a potent mathematical space representation. N-gram statistical language
framework for analyzing the multifactor structures. model [3, 4] is a well-known one among them. The
TSM is further supported by certain introduced entries of the document vector by N-gram
particular operations and presented tools, such as the representation are strings of n consecutive words
High-Order Singular Value Decomposition (HOSVD) extracted from the collections. They are effective
for dimension reduction and other applications. approximations and they not only keep the word-order
Experimental results on the 20 Newsgroups dataset information but also solve the language independent
show that TSM is constantly better than VSM for text problem. However, the high-dimensional feature
classification. vectors of them make many powerful information
retrieval technologies, such as Latent Semantic
1. Introduction and Related Work Indexing (LSI) [2] and Principal Component Analysis
(PCA) [6], unfeasible for large dataset.
Information Retrieval (IR) [2] techniques have During the past few years, the IR researchers have
attracted much attention during the past decades since proposed a variety of effective representation
people are frustrated by being drowned in huge amount approaches for text documents based on VSM.
of data while still being unable to obtain useful However, since the volume of available text data is
information. Vector Space Model (VSM) [2] is the increasing very fast nowadays, more and more
footstone of many information retrieval techniques, researchers suggest that [1]:
which is used to represent the text documents and “Are the further improvements likely to require a
define the similarity among them. broad range of techniques in addition to IR area?”
Bag of Word (BOW) [2] is the earliest approach These motivate us to seek for a new model for text
used to represent document as a bag of words under the documents representation based on some new
VSM. In the BOW representation, a document is techniques. The requirements for new model are to
encoded as a feature vector, with each element in the grasp the context of the word, language independent
vector indicating the presence or absence of a word in and to allow large dataset. In this paper, we propose a

*
This work is done during the first author worked at Microsoft Research Asia as an intern.

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)


1550-4786/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: Fu Jen Catholic Univ.. Downloaded on April 08,2021 at 00:45:54 UTC from IEEE Xplore. Restrictions apply.
novel Tensor Space Model (TSM) for text document I1 ×I2 × ×I N
tensor as be N if A ∈ R . The entries of A are
representation. The proposed TSM is based on the
Ai1 ai1
algebraic character level high-order tensors (the natural denoted by in iN
or in iN
where 1 ≤ in ≤ I n for
generalization of matrices) and offers a potent 1≤ n ≤ N .
mathematical framework for analyzing the multifactor The traditional BOW cannot catch and utilize the
structure [9] . In contrast to VSM, TSM represents a valuable word order information. On the contrary,
text document by high-order tensors instead of vectors although the N-gram representation of documents can
(1-order tensors) and matrices (2-order tensors). The imply this word sequence information, the high
features of each coordinate are letter “a” to “z” and all dimensional vectors generated lead to very high
the other analphabetic symbols such as interpunctions storage and computational complexity. The high
are denoted by “_”. Moreover a dimensionality complexity can fail many powerful tools such as LSI
reduction algorithm is proposed based on a tensor and PCA in the text mining and information retrieval
extension of the conventional matrix Singular Value process. Hence, we propose to use higher order tensors
Decomposition (SVD), known as the High-Order SVD to represent the text documents so that both the word
(HOSVD) [8]. HOSVD technique can find some order information and the complexity problem are
underlying and latent structure of documents and make considered. Moreover, we will show that the proposed
some algorithms such as LSI and PCA easy to be TSM can give many other advantages compared with
implemented under TSM. Moreover, the theoretical the popular used BOW and N-gram models.
analysis and experiments tell us that HOSVD under The TSM is a model of text document
TSM can find some underlying and latent structure of representation. We start from a simple example.
documents and can significantly outperform VSM on Consider the simple document below, which consists
the problems of classification with small training data. of eleven words:
Another contribution of TSM is that it can involve “Text representation: from vector to tensor”
many multilinear algebra techniques to increase the i1i2i3A = {a } ∈ RI1 ×I2 ×I 3
performance of IR. We use a 3-order tensor to
The rest of this paper is organized as follows. In represent this document and index this document by
Section 2, we focus on the multilinear model. The the 26 English letters. All the other characters except
HOSVD algorithm, which is used for computing the the 26 characters such as interpunctions and spaces are
underlying space, is presented in Section 3. The treated as the same and denoted by “_”. The character
experimental results on the 20 Newsgroups [7] are string in this document could be separated by
given in Section 4. Conclusion and future work are characters as:
presented in Section 5. “tex, ext, xt_, t_re, _re, rep, …”
The 26 letters “a” to “z” and “_” scale each axis of
the tensor space. Then the document is represented by
2. Tensor Space Model
a 27 × 27 × 27 tensor. The “_” character corresponds to
“Tensor” is a term in multilinear algebra. It is a zero of each axis, “a” to “z” correspond to 1 to 26 of
generalization of the concepts of “vectors” and each axis. For example, the corresponding position of
“matrices” in the area of linear algebra. Intuitively, a “tex” is (20, 5, 24) since “t” is the 20th character among
vector data structure is called 1-order tensor and a all the 26 English letters, “e” is the 5th and “x” is the
matrix data structure is called a 2-order tensor. Then a 24th. Another example, “xt_” corresponds to (24, 20,
cube-like data structure is called a 3-order tensor and and 0). Then we use the TFIDF method to weight each
so on. In other words, the higher order tensors are position of the tensor in the same way for VSM. By
abstract data structures that are generalization of doing so, each document is represented by a character
vectors and matrices. TSM makes use of the tensor level 3-order tensor, as shown in Figure 1.
structure to describe the text documents and makes use If we put a corpus of documents together, it is a 4-
of the techniques of multilinear algebra to increase the order tensor in a 27 × 27 × 27 × m space, where m is the
performance of IR. number of documents, as illustrated in Figure 2.
To start with, we introduce the following notations. Figure 2 only shows a 4-order TSM. In our model,
In this paper, scalars are denoted by lower case the order of tensor for each document is not limited to
3, thus the order of a tensor for a corpus of documents
letters a, b, , vectors are denoted by normal font
is not limited to 4. Without loss of generality, a corpus
capital letters A, B, , matrices by bold capital letters A, with m documents could be represented as a character
B,…, and the higher-order tensors are denoted by A = {a } ∈ R 27 ×27 × ×27 × m
curlicue capital letters A , B , . We define the order of a
i1i2i3 iN
level tensor where each
document is represented by an (N-1)-order tensor.

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)


1550-4786/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: Fu Jen Catholic Univ.. Downloaded on April 08,2021 at 00:45:54 UTC from IEEE Xplore. Restrictions apply.
Step2. For n = 1 N , compute the matrix U n by
Text performing the SVD of the flattened matrix D(n )
representation:
from vector to where U n is the left matrix in the SVD result.
tensor Step3. Solve the core tensor as follows
Z = D ×1 U1T ×2 U 2T ×N U NT
where matrix U n contains the orthogonal vectors
Figure 1. A document is represented as a character spanning the column space of the matrix D(n ) . D(n ) is
level 3-order tensor the matrix unfolding of the tensor, which is the matrix
representations of that tensor in which all the column
vectors are ranked sequentially.

4. Experiments
A corpus of
documents
4.1. Experiments Setup

We have conducted experiments to compare the


proposed TSM with VSM on 20 Newsgroups Dataset,
which has become a popular data set for experiments in
text applications of machine learning techniques. In
this paper, we select the five classes of 20 Newsgroups
Figure 2. A corpus of documents is represented as a collection about computer science that are very closely
4-order tensor related to each other.
The widely used performance measurements for
3. HOSVD Algorithm text categorization problems are Precision, Recall and
Micro F1 [2]. Precision is a ratio, which could be
VSM represents a group of objects as a “term by computed by the number of right categorized data over
object matrix” and uses SVD technique to decompose the number of all testing data. Recall is a ratio, which
the matrix as D = U1 ∑ U 2T , which is the necessary could be computed by the number of right categorized
technique for PCA and LSI. Similarly, a tensor D in data over the number of all the assigned data. Micro F1
TSM undergoes Higher-Order SVD, which is an is a common measure in text categorization that
extension of matrix SVD. This process is illustrated in combines recall and precision. In this paper, we use the
Figure 3 for the case N=3. Micro F1 measure, which combines recall and
precision into a single score according to the following
formula:
Micro F1 = 2 P × R
P+R
where P is the Precision and R is the Recall [2].

4.2. Experiment Results

Each text document of this dataset is mapped into a


131,072-dimension vector under VSM. We use a 4-
Figure 3. Illustration of a Higher-Order SVD (N=3) order tensor to represent each document under TSM.
The HOSVD algorithm based on TSM is described The original dimension of each tensor is 274=531,441.
as follows: (In Figures, x^y means xy). We use the HOSVD
Step1. Represent a group of documents as a character technique to reduce the dimension of the tensor to
different dimensions (264, 204 and 104). In Figure 4, we
level tensor D = {di1i2i3 iN +1 } ∈ R27 ×27 × ×27 ×m , where m is report the results of VSM in contrast to TSM in
the number of documents. different reduced dimensions.
It can be seen that the result of 4-order tensor with
the original dimension (274) is better than VSM and the
result of 4-order tensor whose dimension is reduced to

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)


1550-4786/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: Fu Jen Catholic Univ.. Downloaded on April 08,2021 at 00:45:54 UTC from IEEE Xplore. Restrictions apply.
264 by HOSVD is the best one. This proves that by document matrix”, and the rank of this matrix
HOSVD can find the principal component and wipe off determines there are at most 165 singular vectors. On
the noise. However, though the performance of 104- the contrary, the HOSVD under TSM can reduce the
dimension reduced tensor is much lower than VSM data to any dimension and is not limited by the number
and original 4-order TSM, the result is acceptable and of samples. Figure 5 shows that the 12 × 12 × 12 reduced
this low dimensional representation could make the tensors can achieve outstanding performance than all
computation more efficient. the others while the SVD under VSM cannot reduce
0.82 data to such a dimension.
0.8
5. Conclusion and Future Works
0.78

0.76 In this paper, we propose to use Tensor Space


Micro F1

Model to represent text documents. By using TSM and


0.74
HOSVD, some underlying and latent structure of
0.72 documents can be found. Theoretical analysis and
0.7 experimental results show that the proposed TSM
keeps the merits but improves some disadvantages of
0.68
VSM for certain IR problems.
VSM(131,072) TSM(27^4) TSM(26^4) TSM(20^4) TSM(10^4)
The TSM proposed in this paper implies many new
Figure 4. Text categorization with VSM and TSM tasks to be done. For instances, design of kernel of
on 20 Newsgroups tensors for better similarity measurement, testing of the
We do not reduce the dimension of 20 Newsgroups performance of TSM on non-language dataset,
data by SVD under VSM since the huge term by customization and application of the techniques
document matrix for 20 Newsgroups to be decomposed originally applied to traditional VSM for TSM, and
is hard to be done due to its high time and space investigation and application of more multilinear
complexity. To compare the VSM by SVD with TSM algebra theorems to increase the performance of IR
by HOSVD, we randomly sampled a subset of the 20 under TSM, are on our future work agenda.
Newsgroups with the ratio of about 5% such that the
data dimension is about 8,000 under VSM re-indexing. 6. References
The subset contains 230 documents in two classes, 165
for training and 65 for testing. By doing so, we can [1] Aslam, J., Belkin, N., Zhai, C., Callan, J., Hiemstra, D.,
perform the matrix SVD on this sampled data. Figure 5 Hofmann, T., Dumais, S., Harper, D.J., et.al. Challenges in
information retrieval and language modeling: report of a
shows the results.
workshop held at the center for intelligent information
0.9 retrieval, University of Massachusetts Amherst, 2001.
[2] Baeza-Yates, R. and Ribeiro-Neto, B. Modern
0.85
Information Retrieval. Addison-Wesley, 1999.
[3] Cavnar, W.B. and Trenkle, J.M., N-Gram-Based
0.8
Text Categorization. In Proceedings of the SDAIR-94,
Micro F1

0.75 3rd Annual Symposium on Document Analysis and


Information Retrieval, (1994), 161--169.
0.7 [4] Croft, W.B. and Lafferty, J. Language Modeling for
Information Retrieval. Kluwer Academic, 2003.
0.65 [5] Gerard, S. and Chris, B. Term Weighting Approaches in
VSM(8,192) VSM(125) VSM(64) TSM(27^3) Automatic Text Retrieval, Technical Report TR87-881,
TSM(5^3) TSM(4^3) TSM(12^3) Department of Computer Science, Cornell University,1987
[6] Jolliffe, I.T. Principal Component Analysis. Spriger
Figure 5. Text categorization on a subset of 20 Verlag, New York, 1986.
Newsgroups [7] Lang, K., NewsWeeder: Learning to Filter Netnews. In
It can be seen that if we reduce the data under VSM Proceedings of the 12th International Conference on
and the data under TSM to the same dimension (125 Machine Learning (ICML 1995), 331-339.
versus 53, 64 versus 43, etc.), the reduced TSM data [8] Lathauwer, L.D., Moor, B.D. and Vandewalle, J. A
can always outperform its counterpart. Moreover, the Multilinear Singular Value Decomposition. SIAM Journal on
Matrix Analysis and Applications, 21. 1253-1278.
dimension of reduced data under VSM cannot larger
[9] Wrede, R.C. Introduction to Vector and Tensor Analysis.
than 165 since the number of documents (smaller than Wiley, 1963.
the term dimension) determines the rank of this “term

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)


1550-4786/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: Fu Jen Catholic Univ.. Downloaded on April 08,2021 at 00:45:54 UTC from IEEE Xplore. Restrictions apply.

You might also like