SAS Visual Text
SAS Visual Text
Subject Index 89
Syntax Index 91
iv
Chapter 1
Shared Concepts
Contents
Introduction to Shared Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Using CAS Sessions and CAS Engine Librefs . . . . . . . . . . . . . . . . . . . . . 1
Loading a SAS Data Set onto a CAS Server . . . . . . . . . . . . . . . . . . . . . . . 2
Details for SAS Visual Analytics Procedures . . . . . . . . . . . . . . . . . . . . . . . . . 3
Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
cas mysess;
libname mycas cas sessref=mysess;
The CAS statement creates the CAS session named mysess, and the LIBNAME statement creates the
mycas CAS engine libref that you use to connect to this session. It is not necessary to explicitly name the
CASHOST and CASPORT of the CAS server in the CAS statement, because these values are retrieved from
the corresponding SAS option values.
If you have created the mysess session, you can terminate it by using the TERMINATE option in the CAS
statement as follows:
You can use a single DATA step to create a data table on the CAS server as follows:
data mycas.Sample;
input y x @@;
datalines;
.46 1 .47 2 .57 3 .61 4 .62 5 .68 6 .69 7
;
Note that DATA step operations might not work as intended when you perform them on the CAS server
instead of the SAS client.
You can create a SAS data set first, and when it contains exactly what you want, you can use another
DATA step to load it onto the CAS server as follows:
data Sample;
input y x @@;
datalines;
.46 1 .47 2 .57 3 .61 4 .62 5 .68 6 .69 7 .78 8
;
data mycas.Sample;
set Sample;
run;
The CASUTIL procedure can load data onto a CAS server more efficiently than the DATA step.
For more information about the CASUTIL procedure, see SAS Cloud Analytic Services: Language
Reference.
The mycas caslib stores the Sample data table, which can be distributed across many machine nodes. You
must use a caslib reference in procedures in this book to enable the SAS client machine to communicate with
the CAS session. For example, the following TEXTMINE procedure statements use a data table that resides
in the mycas caslib:
Multithreading
Threading refers to the organization of computational work into multiple tasks (processing units that can
be scheduled by the operating system). A task is associated with a thread. Multithreading refers to the
concurrent execution of threads. When multithreading is possible, substantial performance gains can be
realized compared to sequential (single-threaded) execution. The number of threads spawned by a procedure
in this book is determined by your installation.
The tasks that are multithreaded by procedures in this book are primarily defined by dividing the data that
are processed on a single machine among the threads—that is, the procedures implement multithreading
through a data-parallel model. For example, if the input data table has 1,000 observations and the procedure
is running on four threads, then 250 observations are associated with each thread. All operations that require
access to the data are then multithreaded. These operations include the following (not all operations are
required for all procedures):
variable levelization
effect levelization
4 F Chapter 1: Shared Concepts
In addition, operations on matrices such as sweeps can be multithreaded provided that the matrices are
of sufficient size to realize performance benefits from managing multiple threads for the particular matrix
operation.
References
Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2001). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. New York: Springer-Verlag.
Chapter 2
The BOOLRULE Procedure
Contents
Overview: BOOLRULE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
PROC BOOLRULE Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Using CAS Sessions and CAS Engine Librefs . . . . . . . . . . . . . . . . . . . . . 7
Getting Started: BOOLRULE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Syntax: BOOLRULE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
PROC BOOLRULE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
DOCINFO Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
OUTPUT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
SCORE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
TERMINFO Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Details: BOOLRULE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
BOOLLEAR for Boolean Rule Extraction . . . . . . . . . . . . . . . . . . . . . . . 17
Term Ensemble Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Rule Ensemble Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Measurements Used in BOOLLEAR . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Precision, Recall, and the F1 Score . . . . . . . . . . . . . . . . . . . . . . . 20
g-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Estimated Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Improvability Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Shrinking the Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Significance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
k-Best Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Improvability Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Early Stop Based on the F1 Score . . . . . . . . . . . . . . . . . . . . . . . 23
Output Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CANDIDATETERMS= Data Table . . . . . . . . . . . . . . . . . . . . . . 23
RULES= Data Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
RULETERMS= Data Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Scoring Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
OUTMATCH= Data Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Examples: BOOLRULE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Example 2.1: Rule Extraction for Binary Targets . . . . . . . . . . . . . . . . . . . . 26
Example 2.2: Rule Extraction for a Multiclass Target . . . . . . . . . . . . . . . . . . 28
Example 2.3: Using Events in Rule Extraction . . . . . . . . . . . . . . . . . . . . . 30
Example 2.4: Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 F Chapter 2: The BOOLRULE Procedure
In this example, ^ indicates a logical “and,” and ~ indicates a logical negation. The first line of the rule set
says that if a document contains the terms “cut,” “rate,” “bank,” and “percent,” but does not contain the term
“sell,” it belongs to the bank interest category.
The BOOLRULE procedure has three advantages when you use a supervised rule-based model to analyze
your large-scale transactional data. First, it focuses on modeling the positive documents in a category.
Therefore, it is more robust when the data are imbalanced.1 Second, the rules can be easily interpreted and
modified by a human expert, enabling better human-machine interaction. Third, the procedure adopts a set of
effective heuristics to significantly shrink the search space for search rules, and its basic operations are set
operations, which can be implemented very efficiently. Therefore, the procedure is highly efficient and can
handle very large-scale problems.
1A data table is imbalanced if it contains many more negative samples than positive samples, or vice versa.
Using CAS Sessions and CAS Engine Librefs F 7
cas mysess;
libname mycas cas sessref=mysess;
The CAS statement creates the CAS session named mysess, and the LIBNAME statement creates the
mycas CAS engine libref that you use to connect to this session. It is not necessary to explicitly name the
CASHOST and CASPORT of the CAS server in the CAS statement, because these values are retrieved from
the corresponding SAS option values.
If you have created the mysess session, you can terminate it by using the TERMINATE option in the CAS
statement as follows:
data mycas.getstart;
infile datalines delimiter='|' missover;
length text $150;
input text$ apple_fruit did$;
datalines;
Delicious and crunchy apple is one of the popular fruits | 1 |d01
Apple was the king of all fruits. | 1 |d02
Custard apple or Sitaphal is a sweet pulpy fruit | 1 |d03
apples are a common tree throughout the tropics | 1 |d04
apple is round in shape, and tasts sweet | 1 |d05
Tropical apple trees produce sweet apple| 1| d06
Fans of sweet apple adore Fuji because it is the sweetest of| 1 |d07
this apple tree is small | 1 |d08
Apple Store shop iPhone x and iPhone x Plus.| 0 |d09
See a list of Apple phone numbers around the world.| 0 |d10
Find links to user guides and contact Apple Support, | 0 |d11
Apple counters Samsung Galaxy launch with iPhone gallery | 0 |d12
Apple Smartphones - Verizon Wireless.| 0 |d13
Apple mercurial chief executive, was furious.| 0 |d14
Apple has upgraded the phone.| 0 |d15
the great features of the new Apple iPhone x.| 0 |d16
Apple sweet apple iphone.| 0 |d17
Apple apple will make cars | 0 |d18
Apple apple also makes watches| 0 |d19
Apple apple makes computers too| 0 |d20
;
run;
These statements assume that your CAS engine libref is named mycas, but you can substitute any appropriately
defined CAS engine libref.
The following statements use the TEXTMINE procedure to parse the input text data. The generated term-by-
document matrix is stored in a data table named mycas.bow. The summary information about the terms in
the document collection is stored in a data table named mycas.terms.
proc boolrule
data = mycas.bow
docid = _document_
termid = _termnum_
docinfo = mycas.getstart
terminfo = mycas.terms
minsupports = 1
mpos = 1
gpos = 1;
docinfo
id = did
targets = (apple_fruit);
terminfo
id = key
label = term;
output
rules = mycas.rules
ruleterms = mycas.ruleterms;
run;
The mycas.bow and mycas.terms data sets are specified as input in the DATA= and TERMINFO= options,
respectively, in the PROC BOOLRULE statement. In addition, the DOCID= and TERMID= options in the
PROC BOOLRULE statement specify the columns of the mycas.bow data table that contain the document
ID and term ID, respectively.
The DOCINFO statement specifies the following information about the mycas.GetStart data table:
The ID= option specifies the column that contains the document ID. The variables in this column are
matched to the document ID variable that is specified in the DOCID= option in the PROC BOOLRULE
statement in order to fetch target information about documents for rule extraction.
The TERMINFO statement specifies the following information about the mycas.terms data table:
The ID= option specifies the column that contains the term ID. The variables in this column are
matched to the term ID variable that is specified in the TERMID= option in the PROC BOOLRULE
statement in order to fetch information about terms for rule extraction.
The LABEL= option specifies the column that contains the text of the terms.
The OUTPUT statement requests that the extracted rules be stored in the data table mycas.Rules.
Figure 2.1 shows the SAS log that PROC BOOLRULE generates; the log provides information about the
default configurations used by the procedure, about where the procedure runs, and about the input and
output files. The log shows that the mycas.rules data table contains two observations, indicating that the
BOOLRULE procedure identified two rules for the apple_fruit category.
10 F Chapter 2: The BOOLRULE Procedure
The following statements PROC PRINT to show the contents of the mycas.rules data table that the BOOL-
RULE procedure generates:
The following statements run the BOOLRULE procedure to match rules in documents and run PROC PRINT
to show the results:
proc boolrule
data = mycas.bow
docid = _document_
termid = _termnum_;
score
ruleterms = mycas.ruleterms
outmatch = mycas.matches;
run;
proc print data=mycas.matches;
run;
Figure 2.3 shows the output of PROC PRINT, the mycas.matches data table. For information about the
output of the OUTMATCH= option, see the section “OUTMATCH= Data Table” on page 25.
Syntax: BOOLRULE Procedure F 11
The PROC BOOLRULE statement invokes the procedure. Table 2.1 summarizes the options in the statement
by function. The options are then described fully in alphabetical order.
12 F Chapter 2: The BOOLRULE Procedure
option Description
Basic Options
DATA= Specifies the input data table (which must be in
transactional format) for rule extraction
DOCID= Specifies the variable in the DATA= data table that
contains the document ID
DOCINFO= Specifies the input data table that contains informa-
tion about documents
GNEG= Specifies the minimum g-score needed for a negative
term to be considered for rule extraction
GPOS= Specifies the minimum g-score needed for a positive
term or a rule to be considered for rule extraction
MAXCANDIDATES= Specifies the number of term candidates to be se-
lected for each category
MAXTRIESIN= Specifies the kin value for k-best search in the term
ensemble process for creating a rule
MAXTRIESOUT= Specifies the kout value for k-best search in the rule
ensemble process for creating a rule set
MINSUPPORTS= Specifies the minimum number of documents in
which a term needs to appear in order for the term
to be used for creating a rule
MNEG= Specifies the m value for computing estimated preci-
sion for negative terms
MPOS= Specifies the m value for computing estimated preci-
sion for positive terms
TERMID= Specifies the variable in the DATA= data table that
contains the term ID
TERMINFO= Specifies the input data table that contains informa-
tion about terms
DATA=CAS-libref.data-table
DOC=CAS-libref.data-table
names the input data table for PROC BOOLRULE to use. CAS-libref.data-table is a two-level name,
where
CAS-libref refers to a collection of information that is defined in the LIBNAME statement and
includes the caslib, which includes a path to the data, and a session identifier, which
defaults to the active session but which can be explicitly defined in the LIBNAME
statement. For more information about CAS-libref , see the section “Using CAS
Sessions and CAS Engine Librefs” on page 7.
data-table specifies the name of the input data table.
PROC BOOLRULE Statement F 13
Each row of the input data table must contain one variable for the document ID and one variable for the
term ID. Both the document ID variable and the term ID variable can be either a numeric or character
variable. The BOOLRULE procedure does not assume that the data table is sorted by either document
ID or term ID.
DOCID=variable
specifies the variable that contains the ID of each document. The document ID can be either a number
or a string of characters.
DOCINFO=CAS-libref.data-table
names the input data table that contains information about documents. CAS-libref.data-table is a
two-level name, where CAS-libref refers to the caslib and session identifier, and data-table specifies
the name of the input data table. For more information about this two-level name, see the DATA=
option and the section “Using CAS Sessions and CAS Engine Librefs” on page 7.
Each row of the input data table must contain one variable for the document ID. The BOOLRULE
procedure uses the document ID in the DATA= data table to search for the document ID variable in
this data table to obtain information about documents (for example, the categories of each document).
GNEG=g-value
specifies the minimum g-score needed for a negative term to be considered for rule extraction in the
term ensemble. If you do not specify this option, the value that is specified for the GPOS= option (or
its default value) is used. For more information about g-score, see the section “g-Score” on page 21.
GPOS=g-value
specifies the minimum g-score needed for a positive term to be considered for rule extraction in the
term ensemble. A rule also needs to have a g-score that is higher than g-value to be considered in the
rule ensemble. The g-value is also used in the improvability test. A rule is improvable if the g-score
that is computed according to the improvability test is larger than g-value. By default, GPOS=8.
MAXCANDIDATES=n
MAXCANDS=n
specifies the number of term candidates to be selected for each category. Rules are built by using only
these term candidates. By default, MAXCANDS=500.
MAXTRIESIN=n
specifies the kin value for the k-best search in the term ensemble process for creating rules. For more
information, see the section “k-Best Search” on page 23. By default, MAXTRIESIN=150.
MAXTRIESOUT=n
specifies the kout value for the k-best search in the rule ensemble process for creating a rule set. For
more information, see the section “k-Best Search” on page 23. By default, MAXTRIESOUT=50.
MINSUPPORTS=n
specifies the minimum number of documents in which a term needs to appear in order for the term to
be used for creating a rule. By default, MINSUPPORTS=3.
14 F Chapter 2: The BOOLRULE Procedure
MNEG=m
specifies the m value for computing estimated precision for negative terms. If you do not specify this
option, the value specified for the MPOS= option (or its default value) is used.
MPOS=m
specifies the m value for computing estimated precision for positive terms. By default, MPOS=8.
TERMID=variable
specifies the variable that contains the ID of each term. The variable can be either a number or a string
of characters. If the TERMINFO= option is not specified, variable is also used as the label of terms.
TERMINFO=CAS-libref.data-table
names the input data table that contains information about terms. CAS-libref.data-table is a two-level
name, where CAS-libref refers to the caslib and session identifier, and data-table specifies the name of
the input data table. For more information about this two-level name, see the DATA= option and the
section “Using CAS Sessions and CAS Engine Librefs” on page 7.
Each row of the input data table must contain one variable for the term ID. If you specify this option,
you must use the TERMINFO statement to specify which variables in the data table contain the term
ID and the term label, respectively. The BOOLRULE procedure uses the term ID in the DATA= data
table to search for the term ID variable in this data table to obtain information about the terms. If you
do not specify this option, the content of the TERMID= variable is also used as the label of terms.
DOCINFO Statement
DOCINFO < options > ;
The DOCINFO statement specifies information about the data table that is specified in the DOCINFO=
option in the PROC BOOLRULE statement.
You can specify the following options:
EVENTS=(value1, value2, : : :)
specifies the values of target variables that are considered as positive events or categories of interest as
follows:
When TARGETTYPE=BINARY, the values of each target variable that is specified in the
TARGET= option correspond to positive events. All other values correspond to negative events.
When TARGETTYPE=BINARY, for any variable specified in the TARGET= option that is a
numeric variable, “1” is considered to be a positive event by default.
When TARGETTYPE=BINARY, for any variable specified in the TARGET= option that is a
character variable, “Y” is considered to be a positive event by default.
You cannot specify this option when TARGETTYPE=MULTICLASS.
ID=variable
specifies the variable that contains the document ID. To fetch the target information about documents,
the values in the variable are matched to the document ID variable that is specified in the DOCID=
option in the PROC BOOLRULE statement. The variable can be either a numeric variable or a
character variable. Its type must match the type of the variable that is specified in the DOCID= option
in the PROC BOOLRULE statement.
OUTPUT Statement F 15
TARGET=(variable, variable, : : :)
specifies the target variables. A target variable can be either a numeric variable or a character variable.
When TARGETTYPE=BINARY, you can specify multiple target variables, and each target
variable corresponds to a category.
When TARGETTYPE=MULTICLASS, you can specify only one target variable, and each of its
levels corresponds to a category.
TARGETTYPE=BINARY | MULTICLASS
specifies the type of the target variables. You can specify the following values:
BINARY indicates that multiple target variables can be specified and each target variable
corresponds to a category.
MULTICLASS indicates that only one target variable can be specified and each level of the target
variable corresponds to a category.
By default, TARGETTYPE=BINARY.
OUTPUT Statement
OUTPUT < options > ;
The OUTPUT statement specifies the data tables that contain the results that the BOOLRULE procedure
generates.
You can specify the following options:
CANDIDATETERMS=CAS-libref.data-table
specifies a data table to contain the terms that have been selected by the BOOLRULE procedure for
rule creation. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and
session identifier, and data-table specifies the name of the output data table. For more information
about this two-level name, see the DATA= option and the section “Using CAS Sessions and CAS
Engine Librefs” on page 7.
If MAXCANDIDATES=p in the BOOLRULE statement, the procedure selects at most p terms for
each category to be considered for rule extraction. For more information about this data table, see the
section “Output Data Sets” on page 23.
RULES=CAS-libref.data-table
specifies a data table to contain the rules that have been generated by the BOOLRULE procedure for
each category. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and
session identifier, and data-table specifies the name of the output data table. For more information
about this two-level name, see the DATA= option and the section “Using CAS Sessions and CAS
Engine Librefs” on page 7.
For more information about this data table, see the section “Output Data Sets” on page 23.
16 F Chapter 2: The BOOLRULE Procedure
RULETERMS=CAS-libref.data-table
specifies a data table to contain the terms in each rule that is generated by the BOOLRULE procedure.
CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and session identifier,
and data-table specifies the name of the output data table. For more information about this two-level
name, see the DATA= option and the section “Using CAS Sessions and CAS Engine Librefs” on
page 7.
For more information about this data table, see the section “Output Data Sets” on page 23.
SCORE Statement
SCORE < options > ;
The SCORE statement specifies the input data table that contains the terms in rules and the output data table
to contain the scoring results.
You can specify the following options:
OUTMATCH=CAS-libref.data-table
specifies a data table to contain the rule-matching results (that is, whether a document satisfies a rule).
CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and session identifier,
and data-table specifies the name of the output data table. For more information about this two-level
name, see the DATA= option and the section “Using CAS Sessions and CAS Engine Librefs” on
page 7.
For more information about this data table, see the section “Scoring Data Set” on page 25.
RULETERMS=CAS-libref.data-table
specifies a data table that contains the terms in each rule that the BOOLRULE procedure generates.
CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and session identifier,
and data-table specifies the name of the input data table. For more information about this two-level
name, see the DATA= option and the section “Using CAS Sessions and CAS Engine Librefs” on
page 7.
For more information about this data table, see the section “RULETERMS= Data Table” on page 25.
TERMINFO Statement
TERMINFO < options > ;
The TERMINFO statement specifies information about the data table that is specified in the TERMINFO=
option in the PROC BOOLRULE statement. If you specify the TERMINFO= data table in the PROC
BOOLRULE statement, you must also include this statement to specify which variables in the data table
contain the term ID and the term label, respectively.
You can specify the following options:
Details: BOOLRULE Procedure F 17
ID=variable
specifies the variable that contains the term ID. To fetch the text of terms, the values in variable are
matched to the term ID variable that is specified in the TERMID= option in the PROC BOOLRULE
statement. The variable can be either a numeric variable or a character variable. Its type must match
the type of the variable that is specified in the TERMID= option in the PROC BOOLRULE statement.
LABEL=variable
specifies the variable that contains the text of the terms, where variable must be a character variable.
1. Use an information gain criterion to form an ordered term candidate list. The term that best predicts the
category is first on the list, and so on. Terms that do not have a significant relationship to the category
are removed from this list. Set the current term to the first term.
2. Determine the “estimated precision” of the current term. The estimated precision is the projected
percentage of the term’s occurrence with the category in out-of-sample data, using additive smoothing.
Create a rule that consists of that term.
3. If the “estimated precision” of the current rule could not possibly be improved by adding more terms
as qualifiers, then go to step 6.
4. Starting with the next term on the list, determine whether the conjunction of the current rule with that
term (via either term presence or term absence) significantly improves the information gain and also
improves estimated precision.
5. If there is at least one combination that meets the criterion in step 4, choose the combination that yields
the best estimated precision, and go to step 3 with that combination. Otherwise, continue to step 6.
6. If the best rule obtained in step 3 has a higher estimated precision than the current “highest precision”
rule, replace the current rule with the new rule.
7. Increment the current term to the next term in the ordered candidate term list and go to step 2. Continue
repeating until all terms in the list have been considered.
18 F Chapter 2: The BOOLRULE Procedure
8. Determine whether the harmonic mean of precision and recall (the F1 score) of the current rule set is
improved by adding the best rule obtained by steps 1 to 7. If it is not, then exit.
9. If so, remove from the document set all documents that match the new rule, add this rule to the rule set,
and go to step 1 to start creating the next rule in the rule set.
BOOLLEAR contains two essential processes for rule extraction: a term ensemble process (steps 4–5), which
creates rules by adding terms; and a rule ensemble process (steps 2–9), which creates a rule set. The rule set
can then be used for either content exploration or text categorization. Both the term ensemble process and the
rule ensemble process are iterative processes. The term ensemble process forms an inner loop of the rule
ensemble process. Efficient heuristic search strategies and sophisticated evaluation criteria are designed to
ensure state-of-the-art performance of BOOLLEAR.
Before adding terms to a rule, BOOLLEAR first sorts the candidate terms in descending order according
to their g-score with respect to the target category. It then starts to add terms to the rule iteratively. In each
BOOLLEAR for Boolean Rule Extraction F 19
iteration of the term ensemble process, BOOLLEAR takes a term t from the ordered candidate term list
and determines whether adding the term to the current rule r can improve the rule’s estimated precision. To
ensure that the term is good enough, BOOLLEAR tries kin 1 additional terms in the term list, where kin
is the maximum number of terms to examine for improvement. If none of these terms is better (results in
a lower g-score of the current rule r) than term t, the term is considered to be k-best, where k D kin , and
BOOLLEAR updates the current rule r by adding term t to it. If one of the kin 1 additional terms is better
than term t, BOOLLEAR sets that term as t and tries kin 1 additional terms to determine whether this new t
is better than all of those additional terms. BOOLLEAR repeats until the current term t is k-best or until it
reaches the end of the term list. After a term is added to the rule, BOOLLEAR marks the term as used and
continues to identify the next k-best term from the unused terms in the sorted candidate term list. When a
k-best term is identified, BOOLLEAR adds it to the rule. BOOLLEAR keeps adding k-best terms until the
rule cannot be further improved. By trying to identify a k-best term instead of the global best, BOOLLEAR
shrinks its search space to improve its efficiency.
In each iteration of the rule ensemble process, BOOLLEAR tries to find a rule r that has the highest precision
in classifying the previously unclassified positive samples. For the first iteration, all samples are unclassified.
To ensure that the precision of rule r is good enough, BOOLLEAR generates kout 1 additional rules, where
20 F Chapter 2: The BOOLRULE Procedure
kout is an input parameter that you specify in the MAXTRIESOUT= option in the PROC BOOLRULE
statement. If one of these rules has a higher precision than rule r, BOOLLEAR sets that rule as the new rule r
and generates another kout 1 rules to determine whether this new rule is the best among them. BOOLLEAR
repeats this process until the current rule r is better than any of the kout 1 rules that are generated after it.
The obtained rule r is called a k-best rule, where k D kout . When BOOLLEAR obtains a k-best rule, it adds
that rule to the rule set and removes from the corpus all documents that satisfy the rule. In order to reduce
the possibility of generating redundant rules, BOOLLEAR then determines whether the F1 score of the rule
set is improved. If the F1 score is improved, BOOLLEAR goes to the next iteration and uses the updated
corpus to generate another rule. Otherwise, it treats the current rule set as unimprovable, stops the search,
and outputs the currently obtained rule set. Note that to identify a “good” rule, BOOLLEAR does not go
through all the potential rules to find the global “best,” because doing so can be computationally intractable
when the number of candidate terms is large. Also, before BOOLLEAR generates a rule, it orders the terms
in the candidate term set by their correlation to the target. So it is reasonable to expect that the obtained
k-best rule is close to a globally best rule in terms of its capability for improving the F1 score of the rule set.
For information about the F1 score, see the section “Precision, Recall, and the F1 Score” on page 20.
TP
precision D
TP C FP
TP
recall D
TP C FN
precision recall
F1 D 2
precision C recall
where TP is the true-positive (the number of documents that are predicted to be positive and are actually
positive), FP is the false-positive (the number of documents that are predicted to be positive but are actually
negative), TN is the true-negative (the number of documents that are predicted to be negative and are actually
negative), and FN is the false-negative (the number of documents that are predicted to be negative but are
actually positive). A classifier thus obtains a high F1 score if and only if it can achieve both high precision
and high recall. The F1 score is a better measurement than accuracy when the data are imbalanced,2 because
a classifier can obtain very high accuracy by predicting that all samples belong to the majority category.
2 Accuracy TPCTN
is defined as TPCFPCTNCFN .
Measurements Used in BOOLLEAR F 21
g-Score
BOOLLEAR uses the g-test (which is also known as the likelihood-ratio or maximum likelihood statistical
significance test) as an information gain criterion to evaluate the correlation between terms and the target.
The g-test generates a g-score, which has two beneficial properties: as a form of mutual information, it is
approximately equivalent to information gain in the binary case; and because it is distributed as a chi-square,
it can also be used for statistical significance testing. The g-test is designed to compare the independence of
two categorical variables. Its null hypothesis is that the proportions at one variable are the same for different
values of the second variable. Given the TP, FP, FN, and TN of a term, the term’s g-score can be computed as
X O.i /
g D2 O .i / log
i DfTP;TN;FP;FNg E.i /
O.TP/ D TP
O.FP/ D FP
O.TN/ D TN
O.FN/ D FN
.TP C FP/ P
E.TP/ D
PCN
.TP C FP/ N
E.FP/ D
PCN
.TN C FN/ N
E.TN/ D
PCN
.TN C FN/ P
E.FN/ D
PCN
where P is the number of positive documents; N is the number of negative documents; O(TP), O(FP), O(TN),
and O(FN) refer to the observed TP, FP, TN, and FN of a term; and E(TP), E(FP), E(TN), and E(FN) refer to
the expected TP, FP, TN, and FN of a term. A term has a high g-score if it appears often in positive documents
but rarely in negative documents, or vice versa.
Estimated Precision
Estimated precision helps BOOLLEAR shorten its search path and avoid generating overly specific rules.
The precision is estimated by a form of additive smoothing with additional correction (erri ) to favor shorter
rules over longer rules:
P
TPi;t C m
precisionm
i .t / D NCP
erri 1
TPi;t C FPi;t C m
P
TPi;t TPi;t C NCP m
erri D C erri 1
TPi;t C FPi;t TPi;t C FPi;t C m
In the preceding equations, m. 1/ is a parameter that you specify for bias correction. A large m is called for
when a very large number of rules are evaluated, in order to minimize selection bias. TPi;t and FPi;t are the
true-positive and false-positive of rule t when the length of the rule is i.
22 F Chapter 2: The BOOLRULE Procedure
Improvability Test
BOOLLEAR tests for improvability in the term ensemble step for “in-process” model pruning. To determine
whether a rule is improvable, BOOLLEAR applies the g-test to a perfect confusion table that is defined as
TP 0
0 FP
In this table, TP is the true-positive of the rule and FP is the false-positive of the rule. The g-score that is
computed by using this table reflects the maximum g-score that a rule could possibly obtain if a perfectly
discriminating term were added to the rule. If the g-score is smaller than a number that you specify to indicate
a maximum p-value for significance in the GPOS= and GNEG= options, BOOLLEAR considers the rule to
be unimprovable.
Feature Selection
BOOLLEAR uses the g-test to evaluate terms. Assume that MAXCANDIDATES=p and MINSUPPORTS=c
in the PROC BOOLRULE statement. A term is added to the ordered candidate term list if and only if the
following two conditions hold:
The size of the candidate term list controls the size of the search space. The smaller the size, the fewer terms
are used for rule extraction, and therefore the smaller the search space is.
Significance Testing
In many rule extraction algorithms, rules are built until they perform perfectly on a training set, and pruning
is applied afterwards. In contrast, BOOLLEAR prunes “in-process.” The following three checks are a form
of in-process pruning; rules are not expanded when their expansion does not meet these basic requirements.
These requirements help BOOLLEAR truncate its search path and avoid generating overly specific rules.
Minimum positive document coverage: BOOLLEAR requires that a rule be satisfied by at least s
positive documents, where s is the value of the MINSUPPORTS= option in the PROC BOOLRULE
statement.
Early stop based on g-test: BOOLLEAR stops searching when the g-score that is calculated for
improving (or starting) a rule does not meet required statistical significance levels.
Output Data Sets F 23
Early stop based on estimated precision: BOOLLEAR stops building a rule when the estimated
precision of the rule does not improve when the current best term is added to the rule. This strategy
helps BOOLLEAR shorten its search path.
k-Best Search
In the worst case, BOOLLEAR could still examine an exponential number of rules, although the heuristics
described here minimize that chance. But because the terms are ordered by predictiveness of the category
beforehand, a k-best search is used to further improve the efficiency of BOOLLEAR: If BOOLLEAR tries
unsuccessfully to expand (or start) a rule numerous times with the a priori “best” candidates, then the search
can be prematurely ended. Two optional parameters, kin and kout , determine the maximum number of terms
and rules to examine for improvement. The kin parameter (which is specified in the MAXTRIESIN= option)
is used in the term ensemble process: if kin consecutive terms have been checked for building possible rules
and none of them are superior to the best current rule, the search is terminated. The kout parameter (which is
specified in the MAXTRIESOUT= option) is used in the rule ensemble process: if kout consecutive terms
have been checked to add to a rule and they do not generate a better rule, then the search for expanding
that rule is terminated. This helps BOOLLEAR shorten its search path, even with a very large number of
candidate terms, with very little sacrifice in accuracy.
Improvability Test
BOOLLEAR tests whether adding a theoretical perfectly discriminating term to a particular rule could
possibly have both a statistically significant result and a higher estimated precision than the current rule. If it
cannot, then the current rule is recognized without additional testing as the best possible rule, and no further
expansion is needed.
Field Description
Target The category that the term is selected for (this field corresponds to the
Target field in the RULES= data table)
Rank The rank of the term in the ordered term list for the category (term rank
starts from 1)
Term A lowercase version of the term
Key The term identifier of the term
GScore The g-score of the term that is obtained for the target category
Support The number of documents in which the term appears
TP The number of positive documents in which the term appears
FP The number of negative documents in which the term appears
Field Description
Target The target category that the term is selected to model
Target_var The variable that contains the target
Target_val The value of the target variable
Ruleid The ID of a rule (Ruleid starts from 1)
Ruleid_loc The ID of a rule in a rule set (in each rule set, Ruleid_loc starts from 1)
Rule The text content of the rule
TP The number of positive documents that are satisfied by the rule set when
the rule is added to the rule set
FP The number of negative documents that are satisfied by the rule set when
the rule is added to the rule set
Support The number of documents that are satisfied by the rule set when the rule
is added to the rule set
rTP The number of positive documents that are satisfied by the rule when the
rule is added to the rule set
rFP The number of negative documents that are satisfied by the rule when
the rule is added to the rule set
rSupport The number of documents that are satisfied by the rule when the rule is
added to the rule set
F1 The F1 score of the rule set when the rule is added to the rule set
Precision The precision of the rule set when the rule is added to the rule set
Recall The recall of the rule set when the rule is added to the rule set
Scoring Data Set F 25
This data table contains the discovered rule sets for predicting the target levels of the target variable. In each
rule set, the order of the rules is important and helps you interpret the results. The first rule is trained using
all the data; the second rule is trained on the data that did not satisfy the first rule; and subsequent rules are
built only after the removal of observations that satisfy previous rules. The fit statistics (TP, FP, Support, F1,
Precision, and Recall) of each rule are cumulative and represent totals that include using that particular rule
along with all the previous rules in the rule set.
When you specify TARGETTYPE=MULTICLASS in the DOCINFO statement, each target level of the
target variable defines a category and the target field contains the same content as the Target_val field. When
TARGETTYPE=BINARY in the DOCINFO statement, each target variable defines a category and the target
field contains the same content as the Target_var field.
Field Description
Target The target category that the term is selected to model
Target_var The variable that contains the target
Target_val The value of the target variable
Ruleid The ID of a rule (Ruleid starts from 1)
Ruleid_loc The ID of a rule in a rule set (in each rule set, Ruleid_loc starts from 1)
Rule The text content of the rule
_termnum_ The ID of a term that is used in the rule
Direction Specifies whether the term is positive or negative (if Direction=1, the
term is positive; if Direction=–1, the term is negative)
Weight The weight of a term
Term weights are used for scoring documents. The weight of a negative term is always –1. If a positive term
is in rule r and there are k positive terms in the rule, the weight of this positive term is 1=k C 0:000001. If a
document contains all the positive terms in the rule but none of the negative terms, the score of the document
is k .1=k C 0:000001/ > 1, indicating that the document satisfies the rule. Otherwise, the document’s
score is less than 1, indicating that the document does not satisfy the rule.
matched in the document) if and only if all the positive terms in the rule are present in the document and
none of the negative terms are present in the document. PROC BOOLRULE also outputs a special rule for
which ID=0. If a document satisfies the rule for which ID=0, then the document does not satisfy any rule in
the RULETERMS= table. For this special rule, the target has a missing value.
Table 2.5 shows the fields in this data table.
Field Description
_Document_ ID of the document that satisfies the rule
_Target_ ID of the target that the rule is generated for
_Rule_ID_ ID of the rule that the document satisfies
data mycas.reviews;
infile datalines delimiter='|' missover;
length text $300 category $20;
input text$ positive category$ did;
datalines;
This is the greatest phone ever! love it!|1|electronics|1
The phone's battery life is too short and screen resolution is low.|0|electronics|2
The screen resolution is low, but I love this tv.|1|electronics|3
The movie itself is great and I like it, although the resolution is low.|1|movies|4
The movie's story is boring and the acting is poor.|0|movies|5
I watched this movie on tv, it's not good on a small screen. |0|movies|6
watched the movie first and loved it, the book is even better!|1|books |7
I like the story in this book, they should put it on screen.|1|books|8
I love the author, but this book is a waste of time, don't buy it.|0|books|9
;
run;
The following TEXTMINE procedure call parses the mycas.reviews data table, stores the term-by-document
matrix in the mycas.reviews_bow data table in transactional format, and stores terms that appeared in the
mycas.reviews data table in the mycas.reviews_terms data table:
Example 2.1: Rule Extraction for Binary Targets F 27
The following statements run PROC BOOLRULE to extract rules from the mycas.reviews_bow data table and
run PROC PRINT to show the results. By default, TARGETTYPE=BINARY. One target variable, positive, is
specified; this variable indicates whether the reviews are positive or negative.
proc boolrule
data = mycas.reviews_bow
docid = _document_
termid = _termnum_
docinfo = mycas.reviews
terminfo = mycas.reviews_terms
minsupports = 1
mpos = 1
gpos = 1;
docinfo
id = did
targets = (positive);
terminfo
id = key
label = term;
output
ruleterms = mycas.ruleterms
rules = mycas.rules;
run;
data rules;
set mycas.rules;
proc print data=rules;
var target ruleid rule F1 precision recall;
run;
Output 2.1.1 shows that the mycas.rules data table contains rules that are generated for the “positive”
categories.
28 F Chapter 2: The BOOLRULE Procedure
data mycas.reviews;
infile datalines delimiter='|' missover;
length text $300 category $20;
input text$ positive category$ did;
datalines;
This is the greatest phone ever! love it!|1|electronics|1
The phone's battery life is too short and screen resolution is low.|0|electronics|2
The screen resolution is low, but I love this tv.|1|electronics|3
The movie itself is great and I like it, although the resolution is low.|1|movies|4
The movie's story is boring and the acting is poor.|0|movies|5
I watched this movie on tv, it's not good on a small screen. |0|movies|6
watched the movie first and loved it, the book is even better!|1|books |7
I like the story in this book, they should put it on screen.|1|books|8
I love the author, but this book is a waste of time, don't buy it.|0|books|9
;
run;
The following TEXTMINE procedure call parses the mycas.reviews data table, stores the term-by-document
matrix in the mycas.reviews_bow data table in transactional format, and stores terms that appeared in the
mycas.reviews data table in the mycas.reviews_terms data table:
The following statements run PROC BOOLRULE to extract rules from the mycas.reviews_bow data table
and run PROC PRINT to show the results. TARGETTYPE=MULTICLASS is specified, and category is
specified as the target variable, which contains three levels: “electronics,” “movies,” and “books.” Each level
defines a category for which the BOOLRULE procedure extracts rules.
proc boolrule
data = mycas.reviews_bow
docid = _document_
termid = _termnum_
docinfo = mycas.reviews
terminfo = mycas.reviews_terms
minsupports = 1
mpos = 1
gpos = 1;
docinfo
id = did
targettype = multiclass
targets = (category);
terminfo
id = key
label = term;
output
ruleterms = mycas.ruleterms
rules = mycas.rules;
run;
data rules;
set mycas.rules;
proc print data=rules;
var target ruleid rule F1 precision recall;
run;
Output 2.2.1 shows that the mycas.rules data table contains rules that are generated for the “electronics,”
“movies,” and “books” categories.
data mycas.reviews;
infile datalines delimiter='|' missover;
length text $300 category $20;
input text$ positive category$ did;
datalines;
This is the greatest phone ever! love it!|1|electronics|1
The phone's battery life is too short and screen resolution is low.|0|electronics|2
The screen resolution is low, but I love this tv.|1|electronics|3
The movie itself is great and I like it, although the resolution is low.|1|movies|4
The movie's story is boring and the acting is poor.|0|movies|5
I watched this movie on tv, it's not good on a small screen. |0|movies|6
watched the movie first and loved it, the book is even better!|1|books |7
I like the story in this book, they should put it on screen.|1|books|8
I love the author, but this book is a waste of time, don't buy it.|0|books|9
;
run;
The following TEXTMINE procedure call parses the mycas.reviews data table, stores the term-by-document
matrix in the mycas.reviews_bow data table in transactional format, and stores terms that appeared in the
mycas.reviews data table in the mycas.reviews_terms data table:
The following statements run PROC BOOLRULE to extract rules from the mycas.reviews_bow data table
and run PROC PRINT to show the results. TARGETTYPE=BINARY is specified, and category is specified
as the target variable, which contains three levels: “electronics,” “movies,” and “books.” Because the “movies”
and “books” levels are specified in the EVENTS= option, PROC BOOLRULE procedure extracts rules for
“movies” and “books,” but not “electronics.”
proc boolrule
data = mycas.reviews_bow
docid = _document_
termid = _termnum_
docinfo = mycas.reviews
terminfo = mycas.reviews_terms
minsupports = 1
mpos = 1
gpos = 1;
docinfo
id = did
targettype = binary
targets = (category)
events = ("movies" "books");
terminfo
id = key
label = term;
output
ruleterms = mycas.ruleterms
rules = mycas.rules;
run;
data rules;
set mycas.rules;
proc print data=rules;
var target ruleid rule F1 precision recall;
run;
Output 2.3.1 shows that the mycas.rules data table contains rules that are generated for the “movies” and
“books” categories.
negative. The category variable contains the category of the reviews. The did variable contains the ID of the
documents. Each row in the data table represents a document for analysis.
data mycas.reviews;
infile datalines delimiter='|' missover;
length text $300 category $20;
input text$ positive category$ did;
datalines;
This is the greatest phone ever! love it!|1|electronics|1
The phone's battery life is too short and screen resolution is low.|0|electronics|2
The screen resolution is low, but I love this tv.|1|electronics|3
The movie itself is great and I like it, although the resolution is low.|1|movies|4
The movie's story is boring and the acting is poor.|0|movies|5
I watched this movie on tv, it's not good on a small screen. |0|movies|6
watched the movie first and loved it, the book is even better!|1|books |7
I like the story in this book, they should put it on screen.|1|books|8
I love the author, but this book is a waste of time, don't buy it.|0|books|9
;
run;
The following DATA step generates the testing data, which contain two observations that have two variables.
The text variable contains the input reviews. The did variable contains the ID of the documents. Each row in
the data table represents a document for analysis.
data mycas.reviews_test;
infile datalines delimiter='|' missover;
length text $300;
input text$ did;
datalines;
love it! a great phone, even better than advertised|1
I like the book, GREATEST in this genre|2
;
run;
The following TEXTMINE procedure call parses the mycas.reviews data table, stores the term-by-document
matrix in the mycas.reviews_bow data table in transactional format, and stores terms that appeared in the
mycas.reviews data table in the mycas.reviews_terms data table:
The following statements run PROC BOOLRULE to extract rules from the mycas.reviews_bow data table.
TARGETTYPE=BINARY is specified. One target variable, positive, is specified; this variable indicates
whether the reviews are positive or negative.
proc boolrule
data = mycas.reviews_bow
docid = _document_
termid = _termnum_
docinfo = mycas.reviews
terminfo = mycas.reviews_terms
minsupports = 1
mpos = 1
gpos = 1;
docinfo
id = did
targettype = binary
targets = (positive);
terminfo
id = key
label = term;
output
ruleterms = mycas.ruleterms
rules = mycas.rules;
run;
The TMSCORE procedure uses the parsing configuration that is stored in the mycas.parseconfig data
table to parse the mycas.reviews_test data table. The term-by-document matrix is stored in the my-
cas.reviews_test_bow data table.
proc tmscore
data = mycas.reviews_test
terms = mycas.reviews_terms
config = mycas.parseconfig
outparent = mycas.reviews_test_bow;
doc_id did;
var text;
run;
The following statements run PROC BOOLRULE to match rules in the testing data and run PROC PRINT to
show the matching results:
proc boolrule
data = mycas.reviews_test_bow
docid = _document_
termid = _termnum_;
score
ruleterms = mycas.ruleterms
outmatch = mycas.match;
run;
The mycas.match data table in Output 2.4.1 shows which documents satisfy which rules.
References
Cox, J., and Zhao, Z. (2014). “System for Efficiently Generating k-Maximally Predictive Association Rules
with a Given Consequent.” US Patent Number 20140337271.
Chapter 3
The TEXTMINE Procedure
Contents
Overview: TEXTMINE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
PROC TEXTMINE Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Using CAS Sessions and CAS Engine Librefs . . . . . . . . . . . . . . . . . . . . . 37
Getting Started: TEXTMINE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Syntax: TEXTMINE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
PROC TEXTMINE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
DOC_ID Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
PARSE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
SAVESTATE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
SELECT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
SVD Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
TARGET Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
VARIABLES Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Details: TEXTMINE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Noun Group Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Entity Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Multiword Terms Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Language Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Term and Cell Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Sparse Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Coordinate List (COO) Format . . . . . . . . . . . . . . . . . . . . . . . . . 58
Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Applications in Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
SVD-Only Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Topic Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Output Data Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
The OUTCHILD= Data Table . . . . . . . . . . . . . . . . . . . . . . . . . 60
The OUTCONFIG= Data Table . . . . . . . . . . . . . . . . . . . . . . . . 60
The OUTDOCPRO= Data Table . . . . . . . . . . . . . . . . . . . . . . . . 61
The OUTPARENT= Data Table . . . . . . . . . . . . . . . . . . . . . . . . 61
The OUTPOS= Data Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
The OUTTERMS= Data Table . . . . . . . . . . . . . . . . . . . . . . . . . 62
36 F Chapter 3: The TEXTMINE Procedure
Functionalities that are related to document parsing, term-by-document matrix creation, and dimension
reduction are integrated into one procedure in order to process data more efficiently.
Parsing supports essential natural language processing (NLP) features, which include tokenizing,
stemming, part-of-speech tagging, noun group extraction, default or customized stop lists and start
lists, entity parsing, multiword tokens, and synonym lists.
Term weighting and filtering are supported for term-by-document matrix creation.
cas mysess;
libname mycas cas sessref=mysess;
The CAS statement creates the CAS session named mysess, and the LIBNAME statement creates the
mycas CAS engine libref that you use to connect to this session. It is not necessary to explicitly name the
CASHOST and CASPORT of the CAS server in the CAS statement, because these values are retrieved from
the corresponding SAS option values.
If you have created the mysess session, you can terminate it by using the TERMINATE option in the CAS
statement as follows:
data mycas.getstart;
infile datalines delimiter='|' missover;
length text $150;
input text$ did;
datalines;
38 F Chapter 3: The TEXTMINE Procedure
proc cas;
loadtable caslib="ReferenceData" path="en_stoplist.sashdat";
run;
quit;
The following statements parse the input collection and use singular value decomposition followed by a
rotation to discover topics that exist in the sample collection. The statements specify that all terms in the
document collection, except for those on the stop list, are to be kept for generating the term-by-document
matrix. The summary information about the terms in the document collection is stored in a data table
named mycas.terms. The SVD statement requests that the first three singular values and singular vectors be
computed. The topic assignments of the documents are stored in a data table named mycas.docpro, and the
descriptive terms that define each topic are stored in a data table named mycas.topics.
mycas.docpro data table contains four variables: the first variable is the document ID, and the remaining
three variables are obtained by projecting the original document onto the three left-singular vectors that have
been rotated with the default orthogonal (varimax) rotation. The mycas.topics data table has 3 variables
containing summary information of the discovered topics. Finally, the mycas.astoretab table contains a
binary representation of a scoring model.
The following statements use PROC PRINT in Base SAS to show the contents of the first 10 rows of the
sorted mycas.docpro data table that is generated by the TEXTMINE procedure:
data docpro;
set mycas.docpro;
run;
proc sort data=docpro;
by did;
run;
proc print data = docpro (obs=10);
run;
Figure 3.2 shows the output of PROC PRINT. For information about the output of the OUTDOCPRO= option,
see the section “The OUTDOCPRO= Data Table” on page 61.
40 F Chapter 3: The TEXTMINE Procedure
The following statements use a DATA step and PROC PRINT to show the contents of the mycas.topics data
table that is generated by the TEXTMINE procedure:
The following statements use a DATA step and the SORT and PRINT procedures to show the first 10
observations of the mycas.terms data table that is generated by the TEXTMINE procedure:
The following DATA step and statements create data and then score that data with PROC ASTORE.
data mycas.scoreData;
infile datalines delimiter='|' missover;
length text $150;
input text$ id;
datalines;
Deployment in the cloud or on-site. | 1
SAS for business analytics. | 2
Maintenance and hidden costs. | 3
run;
proc astore;
score rstore=mycas.aStoreTab
data=mycas.scoreData
out= mycas.scoreResults
copyVars= id;
run;
The PROC TEXTMINE statement invokes the procedure. Table 3.1 summarizes the options in the statement
by function. The options are then described fully in alphabetical order.
option Description
Basic Options
DATA | DOC= Specifies the input document data table
LANGUAGE= Specifies the language that the input data table of documents
uses
NEWVARNAMES Specifies that the new-style variable names should be used
on tables
Multithreading Options
NTHREADS= Specifies number of threads
DATA=CAS-libref.data-table
names the input data table for PROC TEXTMINE to use. The default is the most recently created data
table. CAS-libref.data-table is a two-level name, where
DOC_ID Statement F 43
CAS-libref refers to a collection of information that is defined in the LIBNAME statement and
includes the caslib, which includes a path to the data, and a session identifier, which
defaults to the active session but which can be explicitly defined in the LIBNAME
statement. For more information about CAS-libref , see the section “Using CAS
Sessions and CAS Engine Librefs” on page 37.
data-table specifies the name of the input data table.
Each row of the input data table must contain one text variable and one ID variable that correspond to
the text and the unique ID of a document, respectively.
When you specify the SVD statement but not the PARSE statement, PROC TEXTMINE runs in
SVD-only mode. In this mode, the DATA= option names the input SAS data table that contains the
term-by-document matrix that is generated by the OUTPARENT= option in the PARSE statement.
LANGUAGE=language
names the language that is used by the documents in the input SAS data table. Languages sup-
ported in the current release are Arabic, Chinese, Croatian, Czech, Danish, Dutch, English, Finnish,
French, German, Greek, Hebrew, Indonesian, Italian, Japanese, Korean, Norwegian, Polish, Por-
tuguese, Russian, Slovak, Slovene, Spanish, Swedish, Thai, Turkish and Vietnamese. By default,
LANGUAGE=ENGLISH.
NEWVARNAMES
adds leading and trailing blanks to variable names in the input and output tables.
NTHREADS=nthreads
specifies the number of threads to be used. By default, the number of threads is the same as the number
of CPUs on the CAS server.
DOC_ID Statement
DOC_ID variable ;
The DOC_ID statement specifies the variable that contains the ID of each document. In the input data table,
each row corresponds to one document. The ID of each document must be unique; it can be either a number
or a string of characters.
PARSE Statement
PARSE < parse-options > ;
The PARSE statement specifies the options for parsing the input documents and creating the term-by-
document matrix. Table 3.2 summarizes the parse-options in the statement by function. The parse-options
are then described fully in alphabetical order.
44 F Chapter 3: The TEXTMINE Procedure
parse-option Description
Parsing Options
ENTITIES= Specifies whether to extract entities in parsing
MULTITERM= Specifies the multiword term list
NONOUNGROUPS | NONG Suppresses noun group extraction in parsing
NOSTEMMING Suppresses stemming in parsing
NOTAGGING Suppresses part-of-speech tagging in parsing
SHOWDROPPEDTERMS= Includes dropped terms in the OUTTERMS= data table
START= Specifies the start list
STOP= Specifies the stop list
SYNONYM | SYN= Specifies the synonym list
Term-by-Document Matrix Creation Options
CELLWGT= Specifies how cells are weighted
REDUCEF= Specifies the frequency for term filtering
TERMWGT= Specifies how terms are weighted
Output Options
OUTCHILD= Specifies the data table to contain the raw term-by-document
matrix. All kept terms, whether or not they are child terms,
are represented in this data table along with their correspond-
ing frequency.
OUTCONFIG= Specifies the data table to contain the option settings that
PROC TEXTMINE uses in the current run
OUTPARENT= Specifies the data table to contain the term-by-document
matrix. Child terms are not represented in this data table.
The frequencies of child terms are attributed to their corre-
sponding parents.
OUTTERMS= Specifies the data table to contain the summary information
about the terms in the document collection
OUTPOS= Specifies the data table to contain the position information
about the child terms’ occurrences in the document collection
CELLWGT=LOG | NONE
specifies how the elements in the term-by-document matrix are weighted. You can specify the following
values:
LOG weights cells by using the log formulation. For information about the log formula-
tion for cell weighting, see the section “Term and Cell Weighting” on page 57.
NONE specifies that no cell weight be applied.
PARSE Statement F 45
ENTITIES=STD | NONE
determines whether to use the standard LITI file for entity extraction. You can specify the following
values:
STD uses the standard LITI file for entity extraction. A term such as “George W. Bush”
is recognized as an entity and given the corresponding entity role and attribute. For
this term, the entity role is PERSON and the attribute is Entity. Although the entity
is treated as the single term, “george w. bush,” the individual tokens “george,” “w,”
and “bush” are also included.
NONE does not use the standard LITI file for entity extraction.
By default, ENTITIES=NONE.
MULTITERM=CAS-libref.data-table
specifies the input SAS data table that contains a list of multiword terms. CAS-libref.data-table is a
two-level name, where CAS-libref refers to the caslib and session identifier, and data-table specifies
the name of the input data table. For more information about this two-level name, see the DATA=
option and the section “Using CAS Sessions and CAS Engine Librefs” on page 37. The multiword
terms are case-sensitive and are treated as a single entry by the TEXTMINE procedure. Thus, the
terms “Thank You” and “thank you” are processed differently. Consequently, you must convert all
text strings to lowercase or add each of the multiterm’s case variations to the list before using the
TEXTMINE procedure to create consistent multiword terms. The multiterm data table must have a
variable Multiterm and each of its values must be formatted in the following manner:
multiterm: 3: pos
Specifically, the first item is the multiword term itself followed by a colon, the second item is a number
that represents the token type followed by a colon, and the third item is the part of speech that the
multiword term represents. N OTE : The token type 3 is the most common token type for multiterm
lists; it represents compound words.
NONOUNGROUPS
NONG
suppresses standard noun group extraction. By default, the TEXTMINE procedure extracts noun
groups, returns noun phrases without determiners or prepositions, and (unless the NOSTEMMING
option is specified) stems noun group elements.
NOSTEMMING
suppresses stemming of words. By default, words are stemmed; that is, terms such as “advises” and
“advising” are mapped to the parent term “advise.” The TEXTMINE procedure uses dictionary-based
stemming (also known as lemmatization).
NOTAGGING
suppresses tagging of terms. By default, terms are tagged and the TEXTMINE procedure identifies
a term’s part of speech based on context clues. The identified part of speech is provided in the Role
variable of the OUTTERMS= data table.
46 F Chapter 3: The TEXTMINE Procedure
OUTCHILD=CAS-libref.data-table
specifies the output data table to contain a compressed representation of the sparse term-by-document
matrix. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and session
identifier, and data-table specifies the name of the output data table. For more information about this
two-level name, see the DATA= option and the section “Using CAS Sessions and CAS Engine Librefs”
on page 37. The term counts are not weighted. The data table saves only the kept, representative terms.
The child frequencies are not attributed to their corresponding parent (as they are in the OUTPARENT=
data table). For more information about the compressed representation of the sparse term-by-document
matrix, see the section “The OUTCHILD= Data Table” on page 60.
OUTCONFIG=CAS-libref.data-table
specifies the output data table to contain configuration information that is used for the current run of
PROC TEXTMINE. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib
and session identifier, and data-table specifies the name of the output data table. For more information
about this two-level name, see the DATA= option and the section “Using CAS Sessions and CAS
Engine Librefs” on page 37. The primary purpose of this data table is to relay the configuration
information from the TEXTMINE procedure to the TMSCORE procedure. The TMSCORE procedure
uses options that are consistent with the TEXTMINE procedure. Thus, the data table that is created by
using the OUTCONFIG= option becomes an input data table for PROC TMSCORE and ensures that
the parsing options are consistent between the two runs. For more information about this data table,
see the section “The OUTCONFIG= Data Table” on page 60.
OUTPARENT=CAS-libref.data-table
specifies the output data table to contain a compressed representation of the sparse term-by-document
matrix. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and session
identifier, and data-table specifies the name of the output data table. For more information about
this two-level name, see the DATA= option and the section “Using CAS Sessions and CAS Engine
Librefs” on page 37. The term counts can be weighted, if requested. The data table contains only the
kept, representative terms, and the child frequencies are attributed to the corresponding parent. To
obtain information about the children, use the OUTCHILD= option. For more information about the
compressed representation of the sparse term-by-document matrix, see the section “The OUTPARENT=
Data Table” on page 61.
OUTPOS=CAS-libref.data-table
specifies the output data table to contain the position information about the child terms’ occurrences
in the document collection. CAS-libref.data-table is a two-level name, where CAS-libref refers to
the caslib and session identifier, and data-table specifies the name of the output data table. For more
information about this two-level name, see the DATA= option and the section “Using CAS Sessions
and CAS Engine Librefs” on page 37. For more information about this data table, see the section “The
OUTPOS= Data Table” on page 62.
OUTTERMS=CAS-libref.data-table
specifies the output data table to contain the summary information about the terms in the document
collection. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and session
identifier, and data-table specifies the name of the output data table. For more information about
this two-level name, see the DATA= option and the section “Using CAS Sessions and CAS Engine
Librefs” on page 37. For more information about this data table, see the section “Output Data Tables”
on page 60.
PARSE Statement F 47
REDUCEF=n
removes terms that are not in at least n documents. The value of n must be a positive integer. By
default, REDUCEF=4.
SHOWDROPPEDTERMS
includes the terms that have a keep status of N in the OUTTERMS= data table and the OUTCHILD=
data table.
START=CAS-libref.data-table
specifies the input data table that contains the terms that are to be kept for the analysis. CAS-libref.data-
table is a two-level name, where CAS-libref refers to the caslib and session identifier, and data-table
specifies the name of the input data table. For more information about this two-level name, see the
DATA= option and the section “Using CAS Sessions and CAS Engine Librefs” on page 37. These
terms are displayed in the OUTTERMS= data table with a keep status of Y. All other terms are
displayed with a keep status of N if the SHOWDROPPEDTERMS option is specified or not displayed
if the SHOWDROPPEDTERMS option is not specified. The START= data table must have a Term
variable and can also have a Role variable. You cannot specify both the START= and STOP= options.
STOP=CAS-libref.data-table
specifies the input data table that contains the terms to exclude from the analysis. CAS-libref.data-table
is a two-level name, where CAS-libref refers to the caslib and session identifier, and data-table specifies
the name of the input data table. For more information about this two-level name, see the DATA=
option and the section “Using CAS Sessions and CAS Engine Librefs” on page 37. These terms are
displayed in the OUTTERMS= data table with a keep status of N if the SHOWDROPPEDTERMS
option is specified. The terms are not identified as parents or children. The STOP= data table must
have a Term variable and can also have a Role variable. You cannot specify both the START= and
STOP= options.
SYNONYM=CAS-libref.data-table
SYN=CAS-libref.data-table
specifies the input data table that contains user-defined synonyms to be used in the analysis. CAS-
libref.data-table is a two-level name, where CAS-libref refers to the caslib and session identifier, and
data-table specifies the name of the input data table. For more information about this two-level name,
see the DATA= option and the section “Using CAS Sessions and CAS Engine Librefs” on page 37.
The data table specifies parent-child relationships that enable you to map child terms to a representative
parent. The synonym relationship is indicated in the data table that is specified in the OUTTERMS=
option and is also reflected in the term-by-document data table that is specified in the OUTPARENT=
option. The input synonym data table must have either the two variables Term and Parent or the four
variables Term, Parent, Termrole, and Parentrole. This data table overrides any relationships that are
identified when terms are stemmed. (Terms are stemmed by default; you can suppress stemming by
specifying the NOSTEMMING option.)
TERMWGT=ENTROPY | MI | NONE
specifies how terms are weighted. You can specify the following values:
For more information about the entropy formulation and the mutual information formulation for term
weighting, see the section “Term and Cell Weighting” on page 57.
SAVESTATE Statement
SAVESTATE RSTORE=CAS-libref.data-model ;
The SAVESTATE statement saves a text mining model to a binary object contained in a data table. The object
is referred to as the analytic store and contains the necessary information for scoring a text mining model by
the ASTORE procedure. Only complete text models consisting of both parsing and document projections can
be saved to the analytic store by the TEXTMINE procedure.
You must specify the following option:
RSTORE=CAS-libref.data-model
specifies a data table in which to save the text mining model. CAS-libref.data-table is a two-level name,
where CAS-libref refers to the caslib and session identifier, and data-table specifies the name of the
output data table. For more information about this two-level name, see the DATA= option and the
section “Using CAS Sessions and CAS Engine Librefs” on page 37.
SELECT Statement
SELECT label-list /< GROUP=group-option > KEEP | IGNORE ;
The SELECT statement enables you to specify the parts of speech or entities or attributes that you want to
include in or exclude from your analysis. Exclusion by the SELECT statement is different from exclusion
that is indicated by the _keep variable in the OUTTERMS= data table. Terms that are excluded by the
SELECT statement cannot be included in the OUTTERMS= data table, whereas terms that have _keep=N
can be included in the OUTTERMS= data table if the SHOWDROPPEDTERMS option is specified. Terms
excluded by the SELECT statement are excluded from the OUTPOS= data table, but terms that have _keep=N
are included in OUTPOS= data table. Table 3.3 summarizes the options you can specify in the SELECT
statement. The options are then described fully in syntactic order.
Option Description
label-list Specifies one or more labels of terms that are to be ignored or kept
in your analysis
GROUP= Specifies whether the labels are parts of speech, entities, or at-
tributes
IGNORE Ignores terms whose labels are specified in the label-list
KEEP Keeps terms whose labels are specified in the label-list
You must specify a label-list and either the IGNORE or KEEP option:
SVD Statement F 49
label-list
specifies one or more labels that are either parts of speech or entities or attributes. Each label must
be surrounded by double quotation marks and separated by spaces from other labels. Labels are
case-insensitive. Terms that have these labels are either ignored during parsing (when the IGNORE
option is specified) or kept in the parsing results in the OUTPOS= and OUTTERMS= data tables
(when the KEEP option is specified). Table 3.5 shows all possible part-of-speech tags. Table 3.6 shows
all valid English entities. The attribute variable in Table 3.11 shows all possible attributes.
IGNORE
ignores during parsing all terms whose labels are specified in the label-list , but keeps all other terms in
the parsing results (the OUTPOS= and OUTTERMS= data tables).
KEEP
keeps in the parsing results (the OUTPOS= and OUTTERMS= data tables) only the terms whose labels
are specified in the label-list .
SVD Statement
SVD < svd-options > ;
The SVD statement specifies the options for calculating a truncated singular value decomposition (SVD) of
the large, sparse term-by-document matrix that is created during the parsing phase of PROC TEXTMINE.
Table 3.4 summarizes the svd-options in the statement by function. The svd-options are then described fully
in alphabetical order.
svd-option Description
Input Options
COL= Specifies the column variable, which contains the column indices
of the term-by-document matrix, which is stored in coordinate list
(COO) format
ROW= Specifies the row variable, which contains the row indices of the
term-by-document matrix, which is stored in COO format
ENTRY= Specifies the entry variable, which contains the entries of the term-
by-document matrix, which is stored in COO format
50 F Chapter 3: The TEXTMINE Procedure
svd-option Description
COL=variable
specifies the variable that contains the column indices of the term-by-document matrix. You must
specify this option when you run PROC TEXTMINE in SVD-only mode (that is, when you specify the
SVD statement but not the PARSE statement).
ENTRY=variable
specifies the variable that contains the entries of the term-by-document matrix. You must specify
this option when you run PROC TEXTMINE in SVD-only mode (that is, when you specify the SVD
statement but not the PARSE statement).
EXACTWEIGHT
requests that the weights aggregated during topic derivation not be rounded. By default, the calculated
weights are rounded to the nearest 0.001.
IN_TERMS=CAS-libref.data-table
specifies the input data table that contains information about the terms in the document collection.
CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and session identifier,
and data-table specifies the name of the input data table. For more information about this two-level
name, see the DATA= option and the section “Using CAS Sessions and CAS Engine Librefs” on
SVD Statement F 51
page 37. The data table should have the variables that are described in Table 3.11. The terms are
required to generate topic names in the OUTTOPICS= data table. This option is only for topic discovery
in SVD-only mode. This option conflicts with the PARSE statement, and only one of the two can be
specified. If you want to run SVD-only mode without topic discovery, then you do not need to specify
this option.
K=k
specifies the number of columns in the matrices U, V, and S. This value is the number of dimensions
of the data table after SVD is performed. If the value of k is too large, then the TEXTMINE procedure
runs for an unnecessarily long time. This option takes precedence over the MAX_K= option. This
option also controls the number of topics that are extracted from the text corpus when the ROTATION=
option is specified.
MAX_K=n
specifies the maximum value that the TEXTMINE procedure should return as the recommended value
of k (the number of columns in the matrices U, V, and S) when the RESOLUTION= option is specified
(to recommend the value of k). The TEXTMINE procedure attempts to calculate k dimensions (as
opposed to recommending it) when it performs SVD. This option is ignored if the K= option has been
specified. This option also controls the number of topics that are extracted from the text corpus when
the ROTATION= option is specified.
NOCUTOFFS
uses all weights in the U matrix to form the document projections. When topics are requested, weights
below the term cutoff (as calculated in the OUTTOPICS= data table) are set to 0 before the projection
is formed.
NUMLABELS=n
specifies the number of terms to use in the descriptive label for each topic. The descriptive label
provides a quick synopsis of the discovered topics. The labels are stored in the OUTTOPICS= data
table. By default, NUMLABELS=5.
OUTDOCPRO=CAS-libref.data-table <KEEPVARIABLES=variable-list><NONORMDOC>
OUTDOCPRO=CAS-libref.data-table <KEEPVARS=variable-list><NONORMDOC>
specifies the output data table to contain the projections of the columns of the term-by-document matrix
onto the columns of U. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib
and session identifier, and data-table specifies the name of the output data table. For more information
about this two-level name, see the DATA= option and the section “Using CAS Sessions and CAS
Engine Librefs” on page 37. Because each column of the term-by-document matrix corresponds to
a document, the output forms a new representation of the input documents in a space that has much
lower dimensionality.
You can copy the variables from the data table that is specified in the DATA= option in the PROC
TEXTMINE statement to the data table that is specified in this option. You can specify the following
suboptions:
KEEPVARIABLES=variable-list
attaches the content of the variables that are specified in the variable-list to the output. These
variables must appear in the data table that is specified in the DATA= option in the PROC
TEXTMINE statement.
52 F Chapter 3: The TEXTMINE Procedure
NONORMDOC
suppresses normalization of the columns that contain the projections of documents to have a unit
norm.
OUTTOPICS=CAS-libref.data-table
specifies the output data table to contain the topics that are discovered. CAS-libref.data-table is a
two-level name, where CAS-libref refers to the caslib and session identifier, and data-table specifies
the name of the output data table. For more information about this two-level name, see the DATA=
option and the section “Using CAS Sessions and CAS Engine Librefs” on page 37.
By default, RESOLUTION=HIGH.
ROTATION=VARIMAX | PROMAX
specifies the type of rotation to be used in order to maximize the explanatory power of each topic. You
can specify the following values:
PROMAX does an oblique rotation on the original left singular vectors and generates topics
that might be correlated.
VARIMAX does an orthogonal rotation on the original left singular vectors and generates
uncorrelated topics.
By default, ROTATION=VARIMAX.
ROW=variable
specifies the variable that contains the row indices of the term-by-document matrix. You must specify
this option when you run PROC TEXTMINE in SVD-only mode (that is, when you specify the SVD
statement but not the PARSE statement).
SVDS=CAS-libref.data-table
specifies the output data table to contain the calculated singular values. CAS-libref.data-table is a
two-level name, where CAS-libref refers to the caslib and session identifier, and data-table specifies
the name of the output data table. For more information about this two-level name, see the DATA=
option and the section “Using CAS Sessions and CAS Engine Librefs” on page 37.
TARGET Statement F 53
SVDU=CAS-libref.data-table
specifies the data table to contain the calculated left singular vectors. CAS-libref.data-table is a two-
level name, where CAS-libref refers to the caslib and session identifier, and data-table specifies the
name of the output data table. For more information about this two-level name, see the DATA= option
and the section “Using CAS Sessions and CAS Engine Librefs” on page 37.
SVDV=CAS-libref.data-table
specifies the data table to contain the calculated right singular vectors. CAS-libref.data-table is a
two-level name, where CAS-libref refers to the caslib and session identifier, and data-table specifies
the name of the output data table. For more information about this two-level name, see the DATA=
option and the section “Using CAS Sessions and CAS Engine Librefs” on page 37.
TOL=
specifies the maximum allowable tolerance for the singular value. Let A be a matrix. Suppose i is
the ith singular value of A and i is the corresponding right singular vector. The SVD computation
terminates when for all i 2 f1; : : : ; kg, i and i satisfy kA> A k2 . The default value of is
10 6 , which is more than adequate for most text mining problems.
TARGET Statement
TARGET variable ;
This statement specifies the variable that contains the information about the category that a document belongs
to. The target variable can be any nominal or ordinal variable; it is used in calculating mutual information
term weighting.
VARIABLES Statement
VARIABLES variable ;
VAR variable ;
This statement specifies the variable that contains the text to be processed.
Stemming
Stemming (a special case of morphological analysis) identifies the possible root form of an inflected word.
For example, the word “talk” is the stem of the words “talk,” “talks,” “talking,” and “talked.” In this case “talk”
is the parent, and “talk,” “talks,” “talking,” and “talked” are its children. The TEXTMINE procedure uses
dictionary-based stemming (also known as lemmatization), which unlike tail-chopping stemmers, produces
only valid words as stems. When part-of-speech tagging is on (that is, the NOTAGGING option is not
specified), the stem selection process restricts the stem to be of the same part-of-speech as the original term.
Part-of-Speech Tagging
Part-of-speech tagging uses SAS linguistic technologies to identify or disambiguate the grammatical category
of a word by analyzing it within its context. For example:
I like to bank at the local branch of my bank.
In this case, the first “bank” is tagged as a verb (V), and the second “bank” is tagged as a noun (N). Table 3.5
shows all possible part-of-speech tags.
Entity Identification
Entity identification uses SAS linguistic technologies to classify sequences of words into predefined classes.
These classes are assigned as roles for the corresponding sequences. For example, “nlpPerson,” “nlpPlace,”
Natural Language Processing F 55
“nlpOrganization,” and “nlpMeasure” are identified as classes for “George W. Bush,” “Boston,” “SAS Institute,”
“2.5 inches,” respectively. Table 3.6 shows all valid entities for English. Not all languages support all entities.
Table 3.7 and Table 3.8 indicate the languages that are available for each entity.
Entities Description
nlpDate Date
nlpMeasure Measurement or measurement expression
nlpMoney Currency or currency expression
nlpNounGroup Phrases that contain multiple words
nlpOrganization Organization or company name
nlpPercent Percentage or percentage expression
nlpPerson Person’s name
nlpPlace Addresses, cities, states, and other locations
nlpTime Time or time expression
Language Support
Languages supported in the current release are Arabic, Chinese, Croatian, Czech, Danish, Dutch, English,
Finnish, French, German, Greek, Hebrew, Hindi, Hungarian,Indonesian, Italian, Japanese, Korean, Norwe-
gian, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Spanish, Swedish, Tagalog, Thai, Turkish
and Vietnamese. By turning off some of the advanced parsing functionality, you might be able to use PROC
TEXTMINE effectively with other space-delimited languages.
The equation reduces the influence of highly frequent terms by applying the log function.
When the TERMWGT=ENTROPY option is specified, the following equation is used to weight terms:
X pi;j log2 .pi;j /
wi D 1 C
log2 .n/
j
In this equation, n is the number of documents, and pi;j is the probability that term i appears in document j,
f
which can be estimated by pi;j D gi;j i
, where gi is the global term frequency for term i.
When the TERMWGT=MI option is specified, the following equation is used to weight terms:
P .ti ; Ck /
wi D max log
Ck P .ti / P .Ck /
In this equation, Ck is the set of documents that belong to category k, P .Ck / is the percentage of documents
that belong to category k, and P .ti ; Ck / is the percentage of documents that contain term ti and belong to
category k. Let di be the number of documents that term i appears in. Then P .ti / D dni .
58 F Chapter 3: The TEXTMINE Procedure
Sparse Format
A matrix is sparse when most of its elements are 0. The term-by-document matrix that the TEXTMINE
procedure generates is a sparse matrix. To save storage space, the TEXTMINE procedure supports the COO
format for storing a sparse matrix.
origin would form the best fit line, in a least squares sense, to the original document space. If U has two
columns, then these columns would form the best fit plane to the original document space. In general, the
first k columns of U form the best fit k-dimensional subspace for the document space. Thus, you can project
the columns of A onto the first k columns of U in order to optimally reduce the dimension of the document
space from m to k.
The projection of a document d (one column of A) onto U results in k real numbers that are defined by the
inner product d with each column of U. That is, pi = d> ui . With this representation, each document forms
a k-dimensional vector that can be considered a theme in the document collection. You can then calculate
the Euclidean distance between each document and each column of U to determine the documents that are
described by this theme.
In a similar fashion, you can repeat the previous process by using the rows of A and the first k columns
of V. This generates a best fit k-dimensional subspace for the term space. This representation is used to
group terms into similar clusters. These clusters also represent concepts that are prevalent in the document
collection. Thus, singular value decomposition can be used to cluster both the terms and the documents into
meaningful representations of the entire document collection.
Computation
The computation of the singular vector decomposition is fully parallelized in PROC TEXTMINE via
multithreading and distributed computing. Computing singular value decomposition is an iterative process
that involves considerable communication among the computer nodes in a distributed computing environment.
Therefore, adding more computer nodes for computing singular value decomposition might not always
improve efficiency. Conversely, when the data size is not large enough, adding too many computer nodes
for computation might lead to a noticeable increase in communication time and sometimes might even slow
down the overall computation.
SVD-Only Mode
If you run PROC TEXTMINE without a PARSE statement (called SVD-only mode), PROC TEXTMINE
directly takes the term-by-document matrix as input and computes singular value decomposition (SVD).
This functionality enables you to parse documents and compute the SVD separately in two procedure calls.
This approach is useful when you want to try different parameters for SVD computation after document
parsing. When you run PROC TEXTMINE in SVD-only mode, the DATA= option in the PROC TEXTMINE
statement names the data table that contains the term-by-document matrix.
Topic Discovery
You can use the TEXTMINE procedure to discover topics that exist in your collection. In PROC TEXTMINE,
topics are calculated as a “rotation” of the SVD dimensions in order to maximize the sum of squares of the
term loadings in the V matrix. This rotation preserves the spatial information that the SVD provides, but it
also allows the newly rotated SVD dimensions to become semantically interpretable. Topics are characterized
by a set of weighted terms. Documents that contain many of these weighted terms are highly associated with
the topic, and documents that contain few of them are less associated with the topic. The term scores are
found in the U matrix that has been rotated to maximize the explanatory power of each topic. The columns
of the V matrix characterize the strength of the association of each document with each topic. Finally, the
TEXTMINE procedure can output a topic table that contains the best set of descriptor terms for each topic.
60 F Chapter 3: The TEXTMINE Procedure
Because topic discovery is derived from the U matrix of SVD (each column of the U matrix is rotated and
corresponds to a topic), topic discovery options are specified in the SVD statement.
The term count of “said” in d1 is not attributed to its parent, “say.” The data table that is specified in the
OUTCHILD= option can be combined with the data table that is specified in the OUTTERMS= option to
construct the data table that is specified in the OUTPARENT= option.
When you specify the SHOWDROPPEDTERMS option in the PARSE statement, the data table saves all the
terms that appear in the data table that is specified in the OUTTERMS= option in the PARSE statement.
Variable Indicates
Language Source language of the documents
Stemming Whether stemming is used: “Y” indicates that stemming is used,
and “N” indicates that it is not used
Tagging Whether tagging is used: “Y” indicates that tagging is used, and
“N” indicates that it is not used
1 Kept terms are terms that are marked as kept in the data table specified in the OUTTERMS= option in the PARSE statement.
Output Data Tables F 61
Variable Description
NG Whether noun grouping is used: “Y” indicates that noun grouping
is used, and “N” indicates that it is not used
Entities Whether entities should be extracted: “STD” indicates that entities
should be extracted, and “N” indicates that entities should not be
extracted. When the SELECT statement is specified, “K” indicates
that entities are kept, and “D” indicates that entities are ignored.
Multiterm The name of the multiterm SAS data table
Cellwgt How the cells of the term-by-document matrix are weighted
The term counts can be weighted, if requested. The data table saves only the terms that are marked as kept in
the data table that is specified in the OUTTERMS= option in the PARSE statement. In the data table, the
child frequencies are attributed to the corresponding parent. For example, assume that “said” has term ID t1
and appears eight times in document d1, “say” has term ID t2 and appears one time in document d1, “say”
is the parent of “said”, and neither cell weighting nor term weighting is applied. Then the data table that is
specified in the OUTPARENT= option will contain the following entry:
t2 d1 9
Variable Description
Term A lowercase version of the term
Role The term’s part of speech (this variable is empty if the NOTAG-
GING option is specified in the PARSE statement)
Parent A lowercase version of the parent term
_Start_ The starting position of the term’s occurrence (the first position is
0)
_End_ The ending position of the term’s occurrence
Sentence The sentence where the occurrence appears
Paragraph The paragraph where the occurrence appears (this has not been
implemented in the current release, and the value is always set to 0)
Document The ID of the document where the occurrence appears
Target The value of the target variable that is associated with the document
ID if a variable is specified in the TARGET statement
If you exclude terms by specifying the IGNORE option in the SELECT statement, then those terms are
excluded from the OUTPOS= data table. No synonym lists, start lists, or stop lists are used when generating
the OUTPOS= data table.
Variable Description
Term A lowercase version of the term
Role The term’s part of speech (this variable is empty if the NOTAG-
GING option is specified in the PARSE statement)
Attribute An indication of the characters that compose the term. Possible
attributes are as follows:
Alpha only alphabetic characters
Mixed a combination of attributes
Num only numbers
Punct punctuation characters
Entity an identified entity
Output Data Tables F 63
Variable Description
Freq The frequency of a term in the entire document collection
Numdocs The number of documents that contain the term
_keep The keep status of the term: “Y” indicates that the term is kept for
analysis, and “N” indicates that the term should be dropped in later
stages of analysis. To ensure that the OUTTERMS= data table is
of a reasonable size, only terms that have _keep=Y are kept in the
OUTTERMS= data table by default.
Key The assigned term number (each unique term in the parsed docu-
ments and each unique parent term has a unique Key value)
Parent The Key value of the term’s parent or a “.” (period):
Parent_id Another description of the term’s parent: Parent contains the par-
ent’s term number if a term is a child, but Parent_id contains this
value for all terms.
_ispar An indication of term’s status as a parent, child, or neither:
If you do not specify the SHOWDROPPEDTERMS option in the PARSE statement, this data table saves
only the terms that have _keep=Y. This helps ensure that the OUTTERMS= data table is of a reasonable size.
When you specify the SHOWDROPPEDTERMS option, the data table also saves terms that have _keep=N.
the topic. The weights for the terms and topics are contained in V matrix, which is stored in the data table
that is specified in the SVDV= option in the SVD statement. The _name column contains the generated
topic name, which is the descriptive label for each topic and provides a synopsis of the discovered topics.
The generated topic name contains the terms that have the highest term loadings after the rotation has been
performed. The number of terms that are used in the generated name is determined by the NUMLABELS=
option in the SVD statement.
data mycas.CarNominations;
infile datalines delimiter='|' missover;
length text $70 ;
input text$ i;
datalines;
The Ford Taurus is the World Car of the Year. |1
Hyundai won the award last year. |2
Toyota sold the Toyota Tacoma in bright green. |3
The Ford Taurus is sold in all colors except for lime green. |4
The Honda Insight was World Car of the Year in 2008. |5
;
run;
The following statements run PROC TEXTMINE to parse the documents.
/* 2) starting code */
proc textmine data=mycas.CarNominations;
doc_id i;
var text;
parse
nostemming notagging nonoungroups
termwgt = none
cellwgt = none
reducef = 1
entities = none
outparent = mycas.outparent
outterms = mycas.outterms
outchild = mycas.outchild
outconfig = mycas.outconfig
;
run;
Output 3.1.1 shows the content of the mycas.outterms data table. In this example, stemming, part-of-speech
tagging, and noun group extraction are suppressed and NONE is specified for entity identification, term and
cell weighting, and term filtering. No synonym list, multiterm list, or stop list is specified. As a result of this
configuration, there is no child term in the mycas.outterms data table. Also, the mycas.outparent data table
and the mycas.outchild data table are exactly the same. The TEXTMINE procedure automatically drops
punctuation and numbers.
Obs Term Role Attribute Freq numdocs _keep Key Parent Parent_id _ispar Weight
1 all Alpha 1 1 Y 1 . 1 1
2 toyota Alpha 2 1 Y 2 . 2 1
3 ford Alpha 2 2 Y 3 . 3 1
4 tacoma Alpha 1 1 Y 4 . 4 1
5 year Alpha 3 3 Y 5 . 5 1
6 taurus Alpha 2 2 Y 6 . 6 1
7 won Alpha 1 1 Y 7 . 7 1
8 honda Alpha 1 1 Y 8 . 8 1
9 bright Alpha 1 1 Y 9 . 9 1
10 sold Alpha 2 2 Y 10 . 10 1
11 colors Alpha 1 1 Y 11 . 11 1
12 lime Alpha 1 1 Y 12 . 12 1
13 except Alpha 1 1 Y 13 . 13 1
14 hyundai Alpha 1 1 Y 14 . 14 1
15 in Alpha 3 3 Y 15 . 15 1
16 is Alpha 2 2 Y 16 . 16 1
17 for Alpha 1 1 Y 17 . 17 1
18 world Alpha 2 2 Y 18 . 18 1
19 green Alpha 2 2 Y 19 . 19 1
20 the Alpha 8 5 Y 20 . 20 1
21 of Alpha 2 2 Y 21 . 21 1
22 award Alpha 1 1 Y 22 . 22 1
23 was Alpha 1 1 Y 23 . 23 1
24 car Alpha 2 2 Y 24 . 24 1
25 insight Alpha 1 1 Y 25 . 25 1
26 last Alpha 1 1 Y 26 . 26 1
66 F Chapter 3: The TEXTMINE Procedure
Obs Term Role Attribute Freq numdocs _keep Key Parent Parent_id _ispar Weight
1 all Alpha 1 1 Y 1 . 1 1
2 win Alpha 1 1 Y 2 . 2 + 1
3 toyota Alpha 2 1 Y 3 . 3 1
4 ford Alpha 2 2 Y 4 . 4 1
5 tacoma Alpha 1 1 Y 5 . 5 1
6 year Alpha 3 3 Y 6 . 6 1
7 taurus Alpha 2 2 Y 7 . 7 1
8 won Alpha 1 1 Y 26 2 2 . 1
9 honda Alpha 1 1 Y 8 . 8 1
10 bright Alpha 1 1 Y 9 . 9 1
11 be Alpha 3 3 Y 10 . 10 + 1
12 sold Alpha 2 2 Y 27 11 11 . 1
13 sell Alpha 2 2 Y 11 . 11 + 1
14 colors Alpha 1 1 Y 28 23 23 . 1
15 lime Alpha 1 1 Y 12 . 12 1
16 except Alpha 1 1 Y 13 . 13 1
17 hyundai Alpha 1 1 Y 14 . 14 1
18 in Alpha 3 3 Y 15 . 15 1
19 is Alpha 2 2 Y 29 10 10 . 1
20 for Alpha 1 1 Y 16 . 16 1
21 world Alpha 2 2 Y 17 . 17 1
22 green Alpha 2 2 Y 18 . 18 1
23 the Alpha 8 5 Y 19 . 19 1
24 of Alpha 2 2 Y 20 . 20 1
25 award Alpha 1 1 Y 21 . 21 1
26 was Alpha 1 1 Y 30 10 10 . 1
27 car Alpha 2 2 Y 22 . 22 1
28 color Alpha 1 1 Y 23 . 23 + 1
29 insight Alpha 1 1 Y 24 . 24 1
30 last Alpha 1 1 Y 25 . 25 1
68 F Chapter 3: The TEXTMINE Procedure
Output 3.3.1 The mycas.outterms Data Table with Noun Group Extraction and Entity Identification
Obs Term Role Attribute Freq numdocs _keep Key Parent Parent_id _ispar Weight
1 all Alpha 1 1 Y 1 . 1 1
2 win Alpha 1 1 Y 2 . 2 + 1
3 tacoma Alpha 1 1 Y 3 . 3 1
4 year Alpha 3 3 Y 4 . 4 1
5 taurus Alpha 2 2 Y 5 . 5 1
6 lime green nlpNounGroup Alpha 1 1 Y 6 . 6 1
7 won Alpha 1 1 Y 30 2 2 . 1
8 honda Alpha 1 1 Y 7 . 7 1
9 bright Alpha 1 1 Y 8 . 8 1
10 in 2008 nlpDate Entity 1 1 Y 9 . 9 1
11 be Alpha 3 3 Y 10 . 10 + 1
12 sold Alpha 2 2 Y 31 12 12 . 1
13 bright green nlpNounGroup Alpha 1 1 Y 11 . 11 1
14 sell Alpha 2 2 Y 12 . 12 + 1
15 colors Alpha 1 1 Y 32 27 27 . 1
16 lime Alpha 1 1 Y 13 . 13 1
17 hyundai nlpOrganization Entity 1 1 Y 14 . 14 1
18 except Alpha 1 1 Y 15 . 15 1
19 in Alpha 3 3 Y 16 . 16 1
20 is Alpha 2 2 Y 33 10 10 . 1
21 toyota nlpOrganization Entity 2 1 Y 17 . 17 1
22 last year nlpDate Entity 1 1 Y 18 . 18 1
23 ford nlpOrganization Entity 2 2 Y 19 . 19 1
24 for Alpha 1 1 Y 20 . 20 1
25 world Alpha 2 2 Y 21 . 21 1
26 green Alpha 2 2 Y 22 . 22 1
27 the Alpha 8 5 Y 23 . 23 1
28 of Alpha 2 2 Y 24 . 24 1
29 award Alpha 1 1 Y 25 . 25 1
30 was Alpha 1 1 Y 34 10 10 . 1
31 car Alpha 2 2 Y 26 . 26 1
32 color Alpha 1 1 Y 27 . 27 + 1
33 insight Alpha 1 1 Y 28 . 28 1
34 last Alpha 1 1 Y 29 . 29 1
70 F Chapter 3: The TEXTMINE Procedure
Obs Term Role Attribute Freq numdocs _keep Key Parent Parent_id _ispar Weight
1 is V Alpha 2 2 Y 30 26 26 . 1
2 was V Alpha 1 1 Y 31 26 26 . 1
3 bright A Alpha 1 1 Y 1 . 1 1
4 taurus N Alpha 2 2 Y 2 . 2 1
5 won V Alpha 1 1 Y 32 17 17 . 1
6 for PPOS Alpha 1 1 Y 3 . 3 1
7 lime green nlpNounGroup Alpha 1 1 Y 4 . 4 1
8 lime A Alpha 1 1 Y 5 . 5 1
9 in 2008 nlpDate Entity 1 1 Y 6 . 6 1
10 the DET Alpha 8 5 Y 7 . 7 1
11 except V Alpha 1 1 Y 8 . 8 1
12 bright green nlpNounGroup Alpha 1 1 Y 9 . 9 1
13 color N Alpha 1 1 Y 10 . 10 + 1
14 in PPOS Alpha 3 3 Y 11 . 11 1
15 hyundai nlpOrganization Entity 1 1 Y 12 . 12 1
16 sold V Alpha 2 2 Y 33 28 28 . 1
17 toyota nlpOrganization Entity 2 1 Y 13 . 13 1
18 last year nlpDate Entity 1 1 Y 14 . 14 1
19 ford nlpOrganization Entity 2 2 Y 15 . 15 1
20 all A Alpha 1 1 Y 16 . 16 1
21 win V Alpha 1 1 Y 17 . 17 + 1
22 car PN Alpha 2 2 Y 18 . 18 1
23 colors N Alpha 1 1 Y 34 10 10 . 1
24 award N Alpha 1 1 Y 19 . 19 1
25 insight PN Alpha 1 1 Y 20 . 20 1
26 of PPOS Alpha 2 2 Y 21 . 21 1
27 honda PN Alpha 1 1 Y 22 . 22 1
28 world PN Alpha 2 2 Y 23 . 23 1
29 last A Alpha 1 1 Y 24 . 24 1
30 green N Alpha 2 2 Y 25 . 25 1
31 be V Alpha 3 3 Y 26 . 26 + 1
32 tacoma PN Alpha 1 1 Y 27 . 27 1
33 sell V Alpha 2 2 Y 28 . 28 + 1
34 year N Alpha 3 3 Y 29 . 29 1
72 F Chapter 3: The TEXTMINE Procedure
Output 3.5.1 shows the content of the mycas.outterms data table. You can see that the term “insight” is
assigned the parent term “car”. Only the term “car” appears in the mycas.outparent data table.
Obs Term Role Attribute Freq numdocs _keep Key Parent Parent_id _ispar Weight
1 is V Alpha 2 2 Y 28 25 25 . 1
2 was V Alpha 1 1 Y 29 25 25 . 1
3 bright A Alpha 1 1 Y 1 . 1 1
4 taurus N Alpha 2 2 Y 30 2 2 . 1
5 car N Alpha 4 4 Y 2 . 2 + 1
6 won V Alpha 1 1 Y 31 17 17 . 1
7 for PPOS Alpha 1 1 Y 3 . 3 1
8 lime green nlpNounGroup Alpha 1 1 Y 4 . 4 1
9 lime A Alpha 1 1 Y 5 . 5 1
10 in 2008 nlpDate Entity 1 1 Y 6 . 6 1
11 the DET Alpha 8 5 Y 7 . 7 1
12 except V Alpha 1 1 Y 8 . 8 1
13 bright green nlpNounGroup Alpha 1 1 Y 9 . 9 1
14 color N Alpha 1 1 Y 10 . 10 + 1
15 in PPOS Alpha 3 3 Y 11 . 11 1
16 hyundai nlpOrganization Entity 1 1 Y 12 . 12 1
17 sold V Alpha 2 2 Y 32 26 26 . 1
18 toyota nlpOrganization Entity 2 1 Y 13 . 13 1
19 last year nlpDate Entity 1 1 Y 14 . 14 1
20 ford nlpOrganization Entity 2 2 Y 15 . 15 1
21 all A Alpha 1 1 Y 16 . 16 1
22 win V Alpha 1 1 Y 17 . 17 + 1
23 car PN Alpha 2 2 Y 18 . 18 1
24 colors N Alpha 1 1 Y 33 10 10 . 1
25 award N Alpha 1 1 Y 19 . 19 1
26 insight PN Alpha 1 1 Y 34 2 2 . 1
27 of PPOS Alpha 2 2 Y 20 . 20 1
28 honda PN Alpha 1 1 Y 21 . 21 1
29 world PN Alpha 2 2 Y 22 . 22 1
30 last A Alpha 1 1 Y 23 . 23 1
31 green N Alpha 2 2 Y 24 . 24 1
32 be V Alpha 3 3 Y 25 . 25 + 1
33 tacoma PN Alpha 1 1 Y 35 2 2 . 1
34 sell V Alpha 2 2 Y 26 . 26 + 1
35 year N Alpha 3 3 Y 27 . 27 1
74 F Chapter 3: The TEXTMINE Procedure
data mycas.newStopList;
length Term $16 TermRole $16;
infile datalines delimiter=',';
input Term $ TermRole $;
datalines;
car, PN,
;
run;
Output 3.6.1 shows the content of the mycas.outterms data table. You can see that the term “car, PN” is not
in the mycas.outterms data table because that term and role were added to the custom stop list.
Output 3.6.1 The mycas.outterms Data Table Filtered Using Stop List
Obs Term Role Attribute Freq numdocs _keep Key Parent Parent_id _ispar Weight
1 is V Alpha 2 2 Y 29 25 25 . 1
2 was V Alpha 1 1 Y 30 25 25 . 1
3 bright A Alpha 1 1 Y 1 . 1 1
4 taurus N Alpha 2 2 Y 2 . 2 1
5 won V Alpha 1 1 Y 31 17 17 . 1
6 for PPOS Alpha 1 1 Y 3 . 3 1
7 lime green nlpNounGroup Alpha 1 1 Y 4 . 4 1
8 lime A Alpha 1 1 Y 5 . 5 1
9 in 2008 nlpDate Entity 1 1 Y 6 . 6 1
10 the DET Alpha 8 5 Y 7 . 7 1
11 except V Alpha 1 1 Y 8 . 8 1
12 bright green nlpNounGroup Alpha 1 1 Y 9 . 9 1
13 color N Alpha 1 1 Y 10 . 10 + 1
14 in PPOS Alpha 3 3 Y 11 . 11 1
15 hyundai nlpOrganization Entity 1 1 Y 12 . 12 1
16 sold V Alpha 2 2 Y 32 27 27 . 1
17 toyota nlpOrganization Entity 2 1 Y 13 . 13 1
18 last year nlpDate Entity 1 1 Y 14 . 14 1
19 ford nlpOrganization Entity 2 2 Y 15 . 15 1
20 all A Alpha 1 1 Y 16 . 16 1
21 win V Alpha 1 1 Y 17 . 17 + 1
22 colors N Alpha 1 1 Y 33 10 10 . 1
23 award N Alpha 1 1 Y 18 . 18 1
24 insight PN Alpha 1 1 Y 19 . 19 1
25 of PPOS Alpha 2 2 Y 20 . 20 1
26 honda PN Alpha 1 1 Y 21 . 21 1
27 world PN Alpha 2 2 Y 22 . 22 1
28 last A Alpha 1 1 Y 23 . 23 1
29 green N Alpha 2 2 Y 24 . 24 1
30 be V Alpha 3 3 Y 25 . 25 + 1
31 tacoma PN Alpha 1 1 Y 26 . 26 1
32 sell V Alpha 2 2 Y 27 . 27 + 1
33 year N Alpha 3 3 Y 28 . 28 1
76 F Chapter 3: The TEXTMINE Procedure
Output 3.7.1 shows the content of the mycas.outterms data table. In the preceding statements, “except for” is
defined as an individual term in the third DATA step. In the mycas.outterms data table, you can see that the
two terms “except” and “for” have become one term, “except for.”
Obs Term Role Attribute Freq numdocs _keep Key Parent Parent_id _ispar Weight
1 is V Alpha 2 2 Y 29 25 25 . 1
2 was V Alpha 1 1 Y 30 25 25 . 1
3 bright A Alpha 1 1 Y 1 . 1 1
4 taurus N Alpha 2 2 Y 2 . 2 1
5 won V Alpha 1 1 Y 31 16 16 . 1
6 lime green nlpNounGroup Alpha 1 1 Y 3 . 3 1
7 except for PPOS Alpha 1 1 Y 4 . 4 1
8 lime A Alpha 1 1 Y 5 . 5 1
9 in 2008 nlpDate Entity 1 1 Y 6 . 6 1
10 the DET Alpha 8 5 Y 7 . 7 1
11 bright green nlpNounGroup Alpha 1 1 Y 8 . 8 1
12 color N Alpha 1 1 Y 9 . 9 + 1
13 in PPOS Alpha 3 3 Y 10 . 10 1
14 hyundai nlpOrganization Entity 1 1 Y 11 . 11 1
15 sold V Alpha 2 2 Y 32 27 27 . 1
16 toyota nlpOrganization Entity 2 1 Y 12 . 12 1
17 last year nlpDate Entity 1 1 Y 13 . 13 1
18 ford nlpOrganization Entity 2 2 Y 14 . 14 1
19 all A Alpha 1 1 Y 15 . 15 1
20 win V Alpha 1 1 Y 16 . 16 + 1
21 car PN Alpha 2 2 Y 17 . 17 1
22 colors N Alpha 1 1 Y 33 9 9 . 1
23 award N Alpha 1 1 Y 18 . 18 1
24 insight PN Alpha 1 1 Y 19 . 19 1
25 of PPOS Alpha 2 2 Y 20 . 20 1
26 honda PN Alpha 1 1 Y 21 . 21 1
27 world PN Alpha 2 2 Y 22 . 22 1
28 last A Alpha 1 1 Y 23 . 23 1
29 green N Alpha 2 2 Y 24 . 24 1
30 be V Alpha 3 3 Y 25 . 25 + 1
31 tacoma PN Alpha 1 1 Y 26 . 26 1
32 sell V Alpha 2 2 Y 27 . 27 + 1
33 year N Alpha 3 3 Y 28 . 28 1
78 F Chapter 3: The TEXTMINE Procedure
Output 3.8.1 shows the content of the mycas.outterms data table. You can see that prepositions, determiners,
and proper nouns are excluded. Terms that are labeled “nlpDate” are also excluded.
Output 3.8.1 The mycas.outterms Data Table Ignoring Specified Parts of Speech and Entities
Obs Term Role Attribute Freq numdocs _keep Key Parent Parent_id _ispar Weight
1 is V Alpha 2 2 Y 19 16 16 . 1
2 was V Alpha 1 1 Y 20 16 16 . 1
3 bright A Alpha 1 1 Y 1 . 1 1
4 taurus N Alpha 2 2 Y 2 . 2 1
5 won V Alpha 1 1 Y 21 12 12 . 1
6 lime green nlpNounGroup Alpha 1 1 Y 3 . 3 1
7 lime A Alpha 1 1 Y 4 . 4 1
8 except V Alpha 1 1 Y 5 . 5 1
9 bright green nlpNounGroup Alpha 1 1 Y 6 . 6 1
10 color N Alpha 1 1 Y 7 . 7 + 1
11 hyundai nlpOrganization Entity 1 1 Y 8 . 8 1
12 sold V Alpha 2 2 Y 22 17 17 . 1
13 toyota nlpOrganization Entity 2 1 Y 9 . 9 1
14 ford nlpOrganization Entity 2 2 Y 10 . 10 1
15 all A Alpha 1 1 Y 11 . 11 1
16 win V Alpha 1 1 Y 12 . 12 + 1
17 colors N Alpha 1 1 Y 23 7 7 . 1
18 award N Alpha 1 1 Y 13 . 13 1
19 last A Alpha 1 1 Y 14 . 14 1
20 green N Alpha 2 2 Y 15 . 15 1
21 be V Alpha 3 3 Y 16 . 16 + 1
22 sell V Alpha 2 2 Y 17 . 17 + 1
23 year N Alpha 3 3 Y 18 . 18 1
80
Chapter 4
The TMSCORE Procedure
Contents
Overview: TMSCORE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
PROC TMSCORE Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Using CAS Sessions and CAS Engine Librefs . . . . . . . . . . . . . . . . . . . . . 82
Getting Started: TMSCORE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Syntax: TMSCORE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
PROC TMSCORE Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
DOC_ID Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
VARIABLES Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Details: TMSCORE Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Prerequisites for Running PROC TMSCORE . . . . . . . . . . . . . . . . . 88
Functionalities that are related to document parsing, term-by-document matrix creation, and dimension
reduction are integrated into one procedure to process data more efficiently.
cas mysess;
libname mycas cas sessref=mysess;
The CAS statement creates the CAS session named mysess, and the LIBNAME statement creates the
mycas CAS engine libref that you use to connect to this session. It is not necessary to explicitly name the
CASHOST and CASPORT of the CAS server in the CAS statement, because these values are retrieved from
the corresponding SAS option values.
If you have created the mysess session, you can terminate it by using the TERMINATE option in the CAS
statement as follows:
name. For more information, see the sections “Using CAS Sessions and CAS Engine Librefs” on page 1 and
“Loading a SAS Data Set onto a CAS Server” on page 2 in Chapter 1, “Shared Concepts.”
The following DATA steps generate two data tables: the mycas.getstart data table contains 36 observations,
and the mycas.getstart_score data table contains 31 observations. Both data tables have two variables: the
text variable contains the input documents, and the did variable contains the ID of the documents. Each row
in each data table represents a “document” for analysis.
data mycas.getstart;
infile datalines delimiter='|' missover;
length text $150;
input text$ did;
datalines;
High-performance analytics hold the key to |1
unlocking the unprecedented business value of big data.|2
Organizations looking for optimal ways to gain insights|3
from big data in shorter reporting windows are turning to SAS.|4
As the gold-standard leader in business analytics |5
for more than 36 years,|6
SAS frees enterprises from the limitations of |7
traditional computing and enables them |8
to draw instant benefits from big data.|9
Faster Time to Insight.|10
From banking to retail to health care to insurance, |11
SAS is helping industries glean insights from data |12
that once took days or weeks in just hours, minutes or seconds.|13
It's all about getting to and analyzing relevant data faster.|14
Revealing previously unseen patterns, sentiments and relationships.|15
Identifying unknown risks.|16
And speeding the time to insights.|17
High-Performance Analytics from SAS Combining industry-leading |18
analytics software with high-performance computing technologies|19
produces fast and precise answers to unsolvable problems|20
and enables our customers to gain greater competitive advantage.|21
SAS In-Memory Analytics eliminate the need for disk-based processing|22
allowing for much faster analysis.|23
SAS In-Database executes analytic logic into the database itself |24
for improved agility and governance.|25
SAS Grid Computing creates a centrally managed,|26
shared environment for processing large jobs|27
and supporting a growing number of users efficiently.|28
Together, the components of this integrated, |29
supercharged platform are changing the decision-making landscape|30
and redefining how the world solves big data business problems.|31
Big data is a popular term used to describe the exponential growth,|32
availability and use of information,|33
both structured and unstructured.|34
Much has been written on the big data trend and how it can |35
serve as the basis for innovation, differentiation and growth.|36
run;
84 F Chapter 4: The TMSCORE Procedure
data mycas.getstart_score;
infile datalines delimiter='|' missover;
length text $150;
input text$ did;
datalines;
Big data according to SAS|1
At SAS, consider two other dimensions|2
when thinking about big data:|3
Variability. In addition to the|4
increasing velocities and varieties of data, data|5
flows can be highly inconsistent with periodic peaks.|6
Is something big trending in the social media?|7
Perhaps there is a high-profile IPO looming.|8
Maybe swimming with pigs in the Bahamas is suddenly|9
the must-do vacation activity. Daily, seasonal and|10
event-triggered peak data loads can be challenging|11
to manage - especially with social media involved.|12
Complexity. When you deal with huge volumes of data,|13
it comes from multiple sources. It is quite an|14
undertaking to link, match, cleanse and|15
transform data across systems. However,|16
it is necessary to connect and correlate|17
relationships, hierarchies and multiple data|18
linkages or your data can quickly spiral out of|19
control. Data governance can help you determine|20
how disparate data relates to common definitions|21
and how to systematically integrate structured|22
and unstructured data assets to produce|23
high-quality information that is useful,|24
appropriate and up-to-date.|25
Ultimately, regardless of the factors involved,|26
I believe that the term big data is relative|27
it applies (per Gartner's assessment)|28
whenever an organization's ability|29
to handle, store and analyze data|30
exceeds its current capacity.|31
run;
The following statements use PROC TEXTMINE for processing the input text data table mycas.getstart and
create three data tables (mycas.outconfig, mycas.outterms, and mycas.svdu), which can be used in PROC
TMSCORE for scoring:
The following statements then use PROC TMSCORE to score the input text data table mycas.getstart_score.
The statements take the three data tables that are generated by PROC TEXTMINE as input and create a
data table named mycas.docpro, which contains the projection of the documents in the input data table
mycas.getstart_score.
proc tmscore
data = mycas.getstart_score
terms = mycas.outterms
config = mycas.outconfig
svdu = mycas.svdu
svddocpro = mycas.docpro;
doc_id did;
variables text;
run;
data docpro;
set mycas.docpro;
run;
proc sort data=docpro;
by did;
run;
proc print data = docpro (obs=10);
run;
Figure 4.1 shows the output of PROC PRINT.
The PROC TMSCORE statement invokes the procedure. Table 4.1 summarizes the options in the statement
by function. The options are then described fully in alphabetical order.
option Description
Basic Options
DATA | DOC= Specifies the input document data table
TERMS= Specifies the data table that contains the terms to be used for scoring
CONFIG= Specifies the data table that contains the configuration information
SVDU= Specifies the data table that contains the U matrix whose columns
are the left singular vectors
Output Options
OUTPARENT= Specifies the data table that contains the term-by-document fre-
quency matrix that is used to model the document collection. In
this matrix, the child terms are not represented and child terms’
frequencies are attributed to their corresponding parents.
SVDDOCPRO= Specifies the data table that contains the projections of the docu-
ments
DATA=CAS-libref.data-table
DOC=CAS-libref.data-table
names the input data table for PROC TMSCORE to use. CAS-libref.data-table is a two-level name,
where
CAS-libref refers to a collection of information that is defined in the LIBNAME statement and
includes the caslib, which includes a path to the data, and a session identifier, which
defaults to the active session but which can be explicitly defined in the LIBNAME
statement. For more information about CAS-libref , see the section “Using CAS
Sessions and CAS Engine Librefs” on page 82.
data-table specifies the name of the input data table.
PROC TMSCORE Statement F 87
The input data table contains documents for PROC TMSCORE to score. Each row of the input data
table must contain one text variable and one ID variable, which correspond to the text and the unique
ID of a document, respectively.
CONFIG=CAS-libref.data-table
specifies the input data table that contains configuration information for PROC TMSCORE. CAS-
libref.data-table is a two-level name, where CAS-libref refers to the caslib and session identifier, and
data-table specifies the name of the input data table. For more information about this two-level name,
see the DATA= option and the section “Using CAS Sessions and CAS Engine Librefs” on page 82.
Specify the table that was generated by the OUTCONFIG= option in the PARSE statement of the
TEXTMINE procedure during training. For more information about this data table, see the section
“The OUTCONFIG= Data Table” on page 60 of Chapter 3, “The TEXTMINE Procedure.”
OUTPARENT=CAS-libref.data-table
specifies the output data table to contain a compressed representation of the sparse term-by-document
frequency matrix. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and
session identifier, and data-table specifies the name of the output data table. For more information
about this two-level name, see the DATA= option and the section “Using CAS Sessions and CAS
Engine Librefs” on page 82. The data table contains only the kept representative terms, and the child
frequencies are attributed to the corresponding parent. For more information about the compressed
representation of the sparse term-by-document frequency matrix, see the section “The OUTPARENT=
Data Table” on page 61 of Chapter 3, “The TEXTMINE Procedure.”
SVDDOCPRO=CAS-libref.data-table
specifies the output data table to contain the reduced dimensional projections for each document.
CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and session identifier,
and data-table specifies the name of the output data table. For more information about this two-level
name, see the DATA= option and the section “Using CAS Sessions and CAS Engine Librefs” on
page 82. The contents of this data table are formed by multiplying the term-by-document frequency
matrix by the input data table that is specified in the SVDU= option and then normalizing the result.
SVDU=CAS-libref.data-table
specifies the input data table that contains the U matrix, which is created during training by PROC
TEXTMINE. CAS-libref.data-table is a two-level name, where CAS-libref refers to the caslib and
session identifier, and data-table specifies the name of the input data table. For more information about
this two-level name, see the DATA= option and the section “Using CAS Sessions and CAS Engine
Librefs” on page 82. The data table contains the information that is needed to project each document
into the reduced dimensional space. For more information about the contents of this data table, see the
SVDU= option in Chapter 3, “The TEXTMINE Procedure.”
TERMS=CAS-libref.data-table
specifies the input data table of terms to be used by PROC TMSCORE. CAS-libref.data-table is a
two-level name, where CAS-libref refers to the caslib and session identifier, and data-table specifies the
name of the input data table. For more information about this two-level name, see the DATA= option
and the section “Using CAS Sessions and CAS Engine Librefs” on page 82. Specify the table that was
generated by the OUTTERMS= option in the PARSE statement of the TEXTMINE procedure during
training. This data table conveys to PROC TMSCORE which terms should be used in the analysis
and whether they should be mapped to a parent. The data table also assigns to each term a key that
88 F Chapter 4: The TMSCORE Procedure
corresponds to the key that is used in the input data table that is specified by the SVDU= option. For
more information about this data table, see the section “The OUTTERMS= Data Table” on page 62 of
Chapter 3, “The TEXTMINE Procedure.”
DOC_ID Statement
DOC_ID variable ;
This statement specifies the variable that contains the ID of each document. The ID of each document must
be unique; it can be either a number or a string of characters.
VARIABLES Statement
VARIABLES variable ;
VAR variable ;
This statement specifies the variable that contains the text to be processed.
System Configuration
Prerequisites for Running PROC TMSCORE
To use the TMSCORE procedure, the language binary files that are provided under that license must be
available on the grid for parsing text.
Subject Index
options summary
PARSE statement, 43
PROC TEXTMINE statement, 42
PROC TMSCORE statement, 86
SELECT statement, 48
SVD statement, 49
sparse matrix
TEXTMINE procedure, 61
TEXTMINE procedure, 36
cell weight, 44
coordinate list (COO) format, 58
entity, 45
filtering term by frequency, 47
input data tables, 42
language used by input data tables, 43
multiterm words list, 45
noun groups, 45
number of threads, 43
show dropped terms, 47
sparse format, 58
sparse matrix, 61
start list, 47
stemming, 45
stop list, 47
SVD, singular value decomposition, 58
synonym list, 47
tagging, 45
term weight, 47
transactional style, 61
variable name style, 43
TMSCORE procedure, 81
input data tables, 86
system configuration, 88
TMSCORE procedure, system configuration
prerequisite, 88
transactional style
TEXTMINE procedure, 61
90
Syntax Index
BOOLRULE procedure, 11 DATA= option
DOCINFO statement, 14 PROC BOOLRULE statement, 12
OUTPUT statement, 15 PROC TEXTMINE statement, 42
PROC BOOLRULE statement, 11 PROC TMSCORE statement, 86
SCORE statement, 16 DOC= option
syntax, 11 PROC BOOLRULE statement, 12
TERMINFO statement, 16 PROC TMSCORE statement, 86
BOOLRULE procedure, DOCINFO statement, 14 DOC_ID statement
EVENTS= option, 14 TEXTMINE procedure, 43
ID= option, 14 TMSCORE procedure, 88
TARGET= option, 15 DOCID= option
TARGETTYPE= option, 15 PROC BOOLRULE statement, 13
BOOLRULE procedure, OUTPUT statement, 15 DOCINFO statement
CANDIDATETERMS= option, 15 BOOLRULE procedure, 14
RULES= option, 15 DOCINFO= option
RULETERMS= option, 16 PROC BOOLRULE statement, 13
BOOLRULE procedure, PROC BOOLRULE
statement, 11 ENTITIES= option
DATA= option, 12 PARSE statement, 45
DOC= option, 12 ENTRY= option
DOCID= option, 13 SVD statement, 50
DOCINFO= option, 13 EVENTS= option
GNEG= option, 13 DOCINFO statement, 14
GPOS= option, 13 EXACTWEIGHT option
MAXCANDIDATES= option, 13 SVD statement, 50
MAXCANDS= option, 13
MAXTRIESIN= option, 13 GNEG= option
MAXTRIESOUT= option, 13 PROC BOOLRULE statement, 13
MINSUPPORTS= option, 13 GPOS= option
MNEG= option, 14 PROC BOOLRULE statement, 13
MPOS= option, 14 GROUP= option
TERMID= option, 14 SELECT statement, 49
TERMINFO= option, 14
ID= option
BOOLRULE procedure, SCORE statement, 16
DOCINFO statement, 14
OUTMATCH= option, 16
TERMINFO statement, 17
RULETERMS= option, 16
IGNORE option
BOOLRULE procedure, TERMINFO statement, 16
SELECT statement, 49
ID= option, 17
IN_TERMS= option
LABEL= option, 17
SVD statement, 50
CANDIDATETERMS= option
K= option
OUTPUT statement, 15
SVD statement, 51
CELLWGT= option
KEEP option
PARSE statement, 44
SELECT statement, 49
COL= option
KEEPVARS, KEEPVARIABLES
SVD statement, 50
SVD statement, 51
CONFIG= option
TMSCORE statement, 87 LABEL= option
92 F Syntax Index
TEXTMINE procedure, 53
TMSCORE procedure, 88
VARIABLES statement
TEXTMINE procedure, 53
TMSCORE procedure, 88
Gain Greater Insight into Your
SAS Software with SAS Books.
®
Discover all that you need on your journey to knowledge and empowerment.
support.sas.com/bookstore
for additional books and resources.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are
trademarks of their respective companies. © 2013 SAS Institute Inc. All rights reserved. S107969US.0613