Software for processing text document -> bag-of-words vector
Contact: Sue Ann Hong, sahong@cs.cmu.edu

There are two files in this package:

1. ConvertText2Bow.jar

=======================================
Instructions from the Author (Sophie Wang, sophie.wang@cs.cmu.edu)

To use it, please type
java -cp ConvertText2Bow.jar converttext2bow.Main input_text_file
(BOW_vector_file)
You need to indicate the input text file name. The output file name is
optional. By default, the name is "bow.txt".
=======================================

I believe the program also outputs a vocabulary file "vocabulary.txt" 
somewhere.

You'll have to have java installed on your machine and it's easiest to 
run it in a linux / Mac OS X terminal or cygwin for Windows.


2. convertall.py

I have also attached convertall.py, which is a python script for
converting directories of documents into directories of BOWs. You can
then read in these BOWs in Matlab and assign different labels to
different directories while doing so. To use this, you'll need to
modify the directory names in the script, and then (in a linux
terminal or cygwin, etc with python installed):

$ python convertall.py

You might also have to change "#! /usr/bin/env python" depending on
your machine configuration.


Hope this helps, but if it seems too complicated, you can always write
it yourself in your favorite language (even Matlab, just might be slow).
