Course Guide
Course Guide
Course Guide
IBM BigInsights Text Analytics (v4)
Course code DW654 ERC 1.0
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
IBM Training
Preface
December 2015
NOTICES
This information was developed for products and services offered in the USA.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for
information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to
state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any
non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive, MD-NC119
Armonk, NY 10504-1785
United States of America
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these
changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the
program(s) described in this publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of
those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information
concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available
sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the
examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and
addresses used by an actual business enterprise is entirely coincidental.
TRADEMARKS
IBM, the IBM logo, ibm.com and BigInsights are trademarks or registered trademarks of International Business Machines Corp., registered in many
jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is
available on the web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.
Adobe, and the Adobe logo, are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
© Copyright International Business Machines Corporation 2015.
This document may not be reproduced in whole or in part without the prior written permission of IBM.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Contents
Preface................................................................................................................. P-1
Contents ............................................................................................................. P-3
Course overview................................................................................................. P-6
Document conventions ....................................................................................... P-7
Additional training resources .............................................................................. P-8
IBM product help ................................................................................................ P-9
IBM BigInsights Text Analytics .............................................................................. I
Text Analytics Overview ....................................................................... 1-1
Text Analytics Overview ..................................................................................... 1-1
Unit objectives .................................................................................................... 1-3
Overview of the IBM BigInsights units ................................................................ 1-4
Problem with unstructured data .......................................................................... 1-5
Need to harvest unstructured data ..................................................................... 1-6
Need for structured data ..................................................................................... 1-7
Design your project ............................................................................................ 1-8
Approach for text analytics ............................................................................... 1-10
IBM BigInsights - Text Analytics ....................................................................... 1-11
What's new? ..................................................................................................... 1-12
Multilingual support .......................................................................................... 1-13
Demonstration 1: Extract education histories from biographies ........................ 1-14
Unit summary ................................................................................................... 1-34
Task Analysis ........................................................................................ 2-1
Task Analysis ..................................................................................................... 2-1
Unit objectives .................................................................................................... 2-3
Approach for text analytics ................................................................................. 2-4
Task analysis ..................................................................................................... 2-5
Select a data collection....................................................................................... 2-6
Load the data collection...................................................................................... 2-7
Identifying examples and clues........................................................................... 2-8
Demonstration 1: Finding and identifying clues .................................................. 2-9
Unit summary ................................................................................................... 2-18
Annotation Query Language (AQL) ..................................................... 3-1
Annotation Query Language (AQL) .................................................................... 3-1
Unit objectives .................................................................................................... 3-3
AQL (1 of 2)........................................................................................................ 3-4
AQL (2 of 2)........................................................................................................ 3-5
AQL approach .................................................................................................... 3-6
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Course overview
Preface overview
This course is designed to introduce the student to the capabilities of BigSheets.
BigSheets is a component of IBM BigInsights through the Analyst and the Data
Scientist module. It provides the analyst the ability to be able to visualize and
analyze data stored on the HDFS using a spreadsheet type interface without any
programming.
Intended audience
The course is designed for business analysts that does not want to deal with any
coding to get insight on their data.
Topics covered
Topics covered in this course include:
IBM BigInsights Text Analytics:
• Text Analytics overview
• Task analysis
• Annotation Query Language
• Candidate generation
• Filtering and consolidation
• Working with pre-built extractors
Course prerequisites
Participants should have:
• Students should be familiar with Hadoop and the Linux file system.
• Although not required, it would also be helpful for students to take the
DW644 - IBM BigInsights BigSheets course to have a better understanding of
how BigSheets can be used with Text Analytics.
• Students can attend many free courses at www.bigdatauniversity.com to
acquire the necessary requirements.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Document conventions
Conventions used in this guide follow Microsoft Windows application standards, where
applicable. As well, the following conventions are observed:
• Bold: Bold style is used in demonstration and exercise step-by-step solutions to
indicate a user interface element that is actively selected or text that must be
typed by the participant.
• Italic: Used to reference book titles.
• CAPITALIZATION: All file names, table names, column names, and folder names
appear in this guide exactly as they appear in the application.
To keep capitalization consistent with this guide, type text exactly as shown.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Task- You are working in the product and IBM Product - Help link
oriented you need specific task-oriented help.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit 1 Text Analytics overview
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• Overview of the BigInsights module
• Compare structured vs unstructured data
• Understand how to design your project
• Describe and list the steps used for text analytics
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Known usage
− Represents salary versus zip code
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
• Example:
In the text strings, "The EPS was $1.64” and "The EPS is $ 1.64”, the term
“EPS” is a keyword that identifies the dollar value in the sentence as
earnings per share. You can define an extractor as a sequence with the
literal “EPS” followed by a currency amount within one token or word
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Label
Develop Test Profile Export
Sample snippets
extractors extractors extractors extractors
input Find clues
documents
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
What's new?
What's new?
Here are some of the new features with Text Analytics in the V4.1 release of IBM
BigInsights. You have the option to export the results of your extractors to CSV. This
allows you to use other tools to continue with your analysis. Creating snapshot of
projects is something is was introduced in this release to allow you to rollback your
extractors as you need. Each of the pre-built extractors can be customized per
project allowing a lot of flexibility. There is now support for multiple languages and
English parts of speech (more on this on the next slide). There is now complete
support for scalar functions when creating columns. Now you also have the ability to
export your extractors as a BigSheets function. Finally, there is support for
documents with no file extension as well. Visit the knowledge center for the V4.1
release to find more information regarding these new features.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Multilingual support
• Use of a multilingual tokenizer
Languages that do not use white space tokenization, such as Chinese
languages
Allows use of English Parts of Speech
Does not allow all of the pre-built extractors with other languages
• Set up in the Ambari Home Page
Under the Text Analytics service Configs Advanced ta-web-tooling-
config. Type 'multilingual' or 'standard' to switch between the different
tokenization
• Standard tokenization is much faster than multilingual
Multilingual support
In this release of the Text Analytics tooling, there is multilingual support. Essentially,
support for languages that do not use white space tokenization, such as Chinese
languages. It allows the use of English parts of speech. However, this does not
allow the use of pre-built extractors with other languages. You have to use the
multilingual tokenizer for all other languages.
By default, the standard tokenization is much faster than multilingual, so you need
to set this option in the Ambari Home page if you need the multilingual tokenizer.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1
Extract education histories from biographies
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1:
Extracting education histories from biographies
Purpose:
The purpose of this demonstration is to give you an end-to-end feel of how to
use BigInsights Text Analytics to analyze text data. In subsequent units and
demonstrations, you will get to work with the individual components to better
understand how to use Text Analytics.
User ids / Passwords
OS: biadmin/biadmin
Root: root/dalvm3
Ambari: admin/admin
BigInsights Home: guest/guest-password
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
4. Note the ip address that has been assigned. In the next few steps, you will
update the /etc/hosts file if the ip address listed isn't the same as what is shown
as a result of ifconfig.
5. Switch to the root user using the password dalvm3. Type in:
su -
6. Use your favorite text editor to open up the /etc/hosts file. I will be using gedit.
Type in:
gedit /etc/hosts
7. Update the ip address to that of which was listed when you ran the ifconfig
command.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
• When all of these services have started, click on the Knox service to start
up start the Demo LDAP service under its Service Action menu. You
need this service to authenticate into the BigInsights Home page.
16. Click the Text Analytics link to bring up the Web UI.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2. The file type is Text files. In our case, the files are located on the local
filesystem. Alternatively, you can load files that are residing on your Distributed
File System (DFS). Click Browse… to select your files.
3. Navigate to /home/biadmin/labfiles/topNLPResearchers_bios/
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
3. When the run finishes, you will see something similar to this:
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2. Go ahead and collapse the Extractor Properties pane and resize the Results
pane to make more room. Click on the horizontal bar with the upside-down
triangle to toggle the expand/collapse function. The same bar can be used to
resize if you bring your mouse cursor over it and then click and drag to resize.
3. In the canvas, go ahead and delete all the other extractors. Keep the
Organization one. Select the ones you wish to delete and press the red x.
Alternatively, select and press the Delete key on your keyboard.
Note that you can drag and drop extractors along the canvas.
Task 6. Creating a dictionary of clues to search.
1. At the toolbar above the canvas, click the New Dictionary button.
2. On the canvas, the dictionary shows up. Name the dictionary, Ed Clues.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
4. To add a new term. Click the green plus and add the clues.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Note a couple of things. To see all of the properties, I kept the Results pane
collapsed and resized the Extractor Properties pane.
3. As you may have guessed, you are going to filter out rows that do not contain
the clues in the dictionary. Click on the New Filter button.
4. Click Close to get out of that Warning dialog.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
5. Edit the filter to: Include rows where Organization text contains dictionary
terms in Ed Clues (case sensitivity:Ignore Case).
For the purpose of showing the screenshot: I collapsed the panes to the left and
right of it. You can do the same if you need more room to edit the filter.
6. Before you run the extractor. Restore (expand) the Results pane to see that
there are currently 78 rows where it originally matched. Run the extractor by
clicking on the green play arrow.
7. Now note that the results went from 78 to 30 rows because you are only
matching on the terms in the Ed Clues dictionary. The rest were filtered out.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
4. Select the Degree.txt file from under /home/biadmin/labfiles and import it.
5. Create another dictionary called Major and import the Majors.txt file.
6. You should now have two new dictionaries created and loaded with terms:
Degree and Major.
Task 9. Creating proximity rules.
1. Create a Proximity rule. Click the New Proximity Rule button.
3. Create the same proximity rule again so that you have two proximity rules total.
Task 10. Creating a sequence of extractors.
1. On the canvas, arrange the extractors into a sequence using drag and drop.
Arrange the extractors in this order:
Degree, 1-5 tokens, Major, 1-5 tokens, Organization
When you drag an extractor or a rule next to another, a blue bar will appear on
indicating that it will attach to that side when you let go of the mouse button.
2. When you have done it correctly, you would have created a new sequence:
3. With that sequence selected, run the sequence by clicking the green play
arrow on the toolbar.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
4. Once again, I resized the Results pane so that I can view them. Do so if you
need to for your environment.
Notice in the Results pane, there are four tabs. Each of the tabs represent the
results from each individual extractors, plus the fourth tab for the sequence of
the three extractors with the proximity rules. Because I know the data well, and
this is a made up demonstration scenario, I know that in the sequence tab, you
are missing the sequence from the University of Michigan "Ph.D in 1961 from
the University of Michigan"
5. Right-click on Sequence 1. Select Copy.
6. Right-click on the canvas and select Paste as New Copy.
Note: Paste as New Copy is essentially cloning the original extractor. If you did
the normal paste. any changes made to the source will affect the copy as well.
7. In Sequence 1 copy 1, remove Major and one of the proximity rules by
dragging it out of the sequence. You can delete those two items. This is what it
should look like now.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2. Add a new NULL Span Column. Click the green plus button select New
Column NULL Span Column.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
3. At this point, you should have four columns for Sequence 1 copy 1.
4. Back on the canvas, click on the Sequence 1 extractor and rename the
Sequence 1 column to Ed History (just as you had done for the Sequence 1
copy 1 extractor).
5. Now that both extractors have the same schema, drag and drop Sequence 1 to
align vertically with Sequence 1 copy 1 to create Union 1. The blue bar should
be at either the top or the bottom to indicate a union action.
10. Run the Education History extractor again. Note that the number of returned
rows went from 11 to 7.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
3. In the Results view, click the export button to export the results to csv format.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
4. Select the results, specify your options, choose your location, and click OK.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2. Specify the data source, output folder, and the extractor to run.
Unit summary
• Overview of the BigInsights module
• Compare structured vs unstructured data
• Understand how to design your project
• Describe and list the steps used for text analytics
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Task Analysis
Task Analysis
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
U n i t 2 T a s k a n a l ys i s
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Label
Develop Test Profile Export
Sample snippets
extractors extractors extractors extractors
input Find clues
documents
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Task analysis
• Collect sample documents
E.g. IBM Quarterly Earning Reports from 2006 to 2010
Task analysis
There are two steps to this phase. The first is to collect the sample documents of the
entire dataset you wish to analyze. From these documents, you will manually
examine them to find clues that will help find what you need.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1
Finding and identifying clues
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1:
Finding and identifying clues
Purpose:
This demonstration will show you how to find and identify clues that are
needed for the extractor. In real life, this process would typically be done with
assistance from a subject matter expert, or someone who is familiar with the
documents that you are examining. Prior to starting this demonstration,
ensure that all the necessary Ambari services are up. If you had just
completed Demonstration 1, you are in good shape. Otherwise, refer to
demonstration 1 to get that set up.
User ids / Passwords
OS: biadmin/biadmin
Root: root/dalvm3
Ambari: admin/admin
BigInsights Home: guest/guest-password
5. Click on the Extractor tab to see the list of the pre-built and custom-built
extractors. You can drag and drop these directly onto the canvas to start using
them.
6. On the canvas, select the Degree extractor.
7. Expand the Extractor Properties pane to see its settings. You may need to
resize by click and dragging the pane. Play around with this to get comfortable
in resizing the panes.
Note: You can only resize if the panel is expanded.
8. Under the Extractor Properties, there are three sub-tabs: General, Settings,
and Output. Click the General tab (if it isn't already on it).
9. On the General tab, you can edit the name, provide a description, or define
some tags to assist in being more easily searchable among the Extractor
catalogs. We will not do anything here, this is just for your information.
10. Click on the Settings tab. On here, you can modify the terms in the dictionary
(in this case) or if it was a different extractor, modify the settings of that one.
11. Click on the Output tab. Here is where you can specify the columns from the
extractor.
12. On the canvas, click on the Education History extractor and run it.
13. Go ahead and collapse the Extractor Properties and expand and resize the
Results pane so that it is more visible.
14. Each tab on the results pane comes from a single extractor. In our case, we
have a single union of multiple extractors, so we have single tab. Within that
one tab, however, we have multiple results, one for each of the extractors that
made up that union. Examine the results to see the various columns.
15. Click on any row and you will see that the results are highlighted within the
document on the Documents pane (on the right).
16. Remember, you have the option to export your results as a CSV file for further
analysis with a different tool.
17. On the Documents pane, you can toggle between single document view and
multiple document views. Go ahead and click on it to see it in action.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
18. Next to that is another button, Show Extractor Name. This is a nice little
feature that tells you which extractor found the results. For example, select one
of the rows from the Results pane.
19. Now click the Show Extractor Name button to see which extractor it was:
Obviously, in this case, we only had one extractor, but if you ran with multiple
extractors, you can use this to find out which one captured that result. This can
help with debugging if you end up finding terms that should or shouldn't be part
of the result set.
20. Finally, the third button is the Remove tag / Remap tag. This is used for
documents where you may have tags, such as XML documents.
21. If you need additional help, at the upper right corner, there is a dropdown icon.
Click on that and you can visit the help section for Text Analytics.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2. Specify the file type as Text files and the file location as Local files. Click
Browse to select the files.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
3. Navigate to /home/biadmin/labfiles/WatsonData/Data/.
4. Select all the files. Use CTRL + A to select all the files and click Open.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2. Next, make it easier to read by removing the tags by clicking on the Remove
tag icon.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
The first reference to Watson in the text was related to a competition. The
second reference was IBM Watson technology. This is a reference in which we
have an interest. And there are two clues that are of value, IBM and technology.
It is the word Watson in context with these clue words that allow us to make the
assumption as to the meaning of the word, Watson, used here.
Positive clues: Watson, IBM, Technology
6. Locate the SM010.txt file.
7. Examine the file and take note of the words Solutions and computer. These
clues also relates to the Watson technology and will help the computer figure
out if the Watson within the document is the Watson we want.
Positive clues: Watson, IBM, Technology, Solutions, Computer
8. Locate the SM005.txt file.
9. Examine this file and take note of the word System.
Positive clues: Watson, IBM, Technology, Solutions, Computer, System
10. Locate the SM011.txt file.
11. Examine the document and take note of the word Jeopardy
Positive clues: Watson, IBM, Technology, Solutions, Computer, System,
Jeopardy
12. Locate the SM063.txt.
13. Here we will look for some negative clues, or clues that may give false positives
(e.g. returning Watson where it does not have anything to do with technology,
but rather, a person's name or something of that nature).
False positive clues: Todd Watson
14. Locate the SM121.txt. It's on page 25 if you are searching by page number.
15. In this file you have Watson Research Center in Yorktown Heights. Research
would be another good false positive:
False positive clues: Todd Watson, Research
16. At this point, we have enough information to work with to demonstrate the
capability of BigInsights Text Analytics.
Results:
At the end of this demo, you should be able to identify clues that are needed
for the extractors. You understand that typically, this process would involve
someone who is familiar with the documents, such as a subject matter expert.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit 3 Annotation Query Language
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• Describe the AQL data model
• List the AQL components
• List the AQL objects that are used to create basic features
• Describe the Information Extraction Web Tool
• Describe the categories of the pre-built extractors
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
AQL (1 of 2)
• Data model
Similar to the standard relational model
You work with views
Data is stored in tuples
Tuples have attributes
• Scalar types
Integer – 32-bit signed integer
Float – Single precision floating-point number
Text – Unicode string
− Has additional metadata to indicate to which tuple the string belongs
Span – Contiguous region of characters in a text object
List – Represents a bag of values of type (Integer, Float, Text, or Span)
AQL (1 of 2)
The unit will cover a bit about how the underlying AQL operates.
The data model for AQL is similar to the standard relational model used by SQL
databases. To extract data, you create views. These are very similar to a table in a
relational system. A view forms a relation. All data in AQL is stored in tuples or what
you might think of as rows. Each tuple is make up of attributes, essentially columns
in relational tables. All tuples in a relation must have the same schema, meaning the
name and type of each attribute must be the same for all tuples in that relation.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
AQL (2 of 2)
• Execution model
An AQL extractor consists of a collection of views
− Each view defines a relation
• Reuse is implemented via the export and import statements
AQL (2 of 2)
An AQL extractor consists of a collection of views, each of which defines a relation.
The text analytics tooling within BigInsights makes it easy for you by keeping all of
this under the covers. In fact, you will be working mainly with extractors through the
web UI.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
AQL approach
AQL
Components Final
Consolidation
and filtering
Consolidation
Set filtering 3
Predicate filtering
Candidate
Patterns Blocks Select Patterns Union all generation
Blocks
Select 2
Union all
Part of speech
Dictionary Part of
Dictionary Regex Dictionary Regex Basic features
Regex speech
Split 1
AQL approach
Creating AQL extractors is a multi-step, multi-layered process. You first start by
creating fundamental components that are very specific in nature using the basic
features of the language. These are along the lines of finding numeric strings in the
data or locating all of the division names in the document,
These basic features are then used for candidate generation. At this level in the
process you might be using multiple basic features in order to find occurrences of
amounts, such as $1.4 billion. And then using that data to find the amounts that are
associated with particular divisions. One of the things that you might find when
generating candidates is that you extract more data than you require.
The third step is to then consolidate or filter the candidate results so that you only
extract the desired data. The next couple units covers each of these steps starting
with the basic features step in this unit.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Creating dictionaries
• Dictionaries can be created from
External dictionary files
Inline dictionary declarations
New Dictionary
Import file
Add terms
Creating dictionaries
Click the New Dictionary button on the toolbar. Then you specify the name of the
dictionary. You then have one of two ways to load that dictionary. Either by directly
typing in each term or specifying a file to load into the dictionary – this is all done
through the Extractor Properties.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Regular expression
• Add a new regular expression in the Extractor Properties
Regular expression
Here's the screenshot to create a regular expression extractor. You input your own
regular expression and the tool will extract based on your input.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Pre-built extractors (1 of 2)
Category Use
Finance Actions Extractors that identify and extract
information about corporate financial
activities, such as acquisitions and
mergers or earnings reports and the
parties involved.
Named Entity Recognition Extractors that identify and extract
information about people, locations,
organizations, and contact methods.
Generic Extractors that generally extract
information on the basis of a single
word or number, such as capitalized
word or a currency amount.
Pre-built extractors (1 of 2)
There are pre-built extractors included with the BigInsights Text Analytics tooling.
The extractors falls into 5 different categories today. These categories are listed
across two slides.
There are finance related extractors that extracts information about corporate
financial activities.
There are named entity recognition to extract information about people.
There are generic extractors to extract basic information such as single word or
number, or a currency amount.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Pre-built extractors (2 of 2)
Category Use
Machine Data Analytics Extractors that parse and extract
information from log files, including
Hadoop log files.
Sentiment Analysis Extractors that use deep natural
language processing to infer the
sentiment being expressed.
Pre-built extractors (2 of 2)
There is a category for machine data analytics with extractors that parse and extract
from log files, including Hadoop log files.
There is a sentiment analysis category for extractors that use deep natural language
processing to infer sentiment.
You will work with some of these in a later lab demo.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Label
Develop Test Profile Export
Sample snippets
extractors extractors extractors extractors
input Find clues
documents
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1
Creating dictionaries for your Watson project
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1:
Working with basic AQL features using the Web UI
Purpose:
The purpose of the demo is to get you started with the extractors using the
Web UI. You will build dictionaries using the clues that you identified in the
previous demo. Everything is done by dragging and dropping the extractors
onto the canvas.
User ids / Passwords
OS: biadmin/biadmin
Root: root/dalvm3
Ambari: admin/admin
BigInsights Home: guest/guest-password
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
5. It should return 298 rows where the word Watson was found within the
documents.
If you do not get 298 rows, it's probably because you didn't remove the tags
when you loaded the document. Remove the tags by clicking on the Remove
tag button:
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Results:
you should be able to build dictionaries using the clues that you identified in
the previous example.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• Describe the AQL data model
• List the AQL components
• List the AQL objects that are used to create basic features
• Describe the Information Extraction Web Tool
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Candidate generation
Candidate generation
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit 4 Candidate generation
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• Understand the general guidelines for developing extractors
• List the AQL candidate generation components
• Explain the use of sequence patterns, proximity rules and union
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
• Identify candidates
• Complete occurrences of the target extraction object by combining
the basic features identified in Step 1
• Write different rules for different combinations pattern of basic
features.
• Don’t worry too much about false positive mistakes (we will handle
them in Step 3)
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Candidate rules
• Used to create more sophisticated views by building on basic feature
views
• Blocks
Used to identify blocks of contiguous spans in a document
• Sequence patterns
Used to perform pattern matching
Can employ previously extracted spans
• Union all
Combines tuples from two views that have the same schema
• Select
Used to construct and combine sets of tuples based on various
specifications
Candidate rules
Remember that all these can be done through custom AQL coding, but in our web
UI, you have the ability to use the canvas and not have to worry about coding any
AQL.
Basic features return individual building block components. You can think of each
object created using a basic feature as being a brick when building a house. By
itself, it is of very little use. But when combined with other bricks, you can then have
a wall of the house which is a much more significant component of the house.
Candidate rules allow you to combine basic feature objects to create more
sophisticated views.
For example, finding all of the occurrences of the world million in your document, in
and of itself, does not help very much. The same is true for finding all numbers. But
being able to find a number that is followed by the world million now becomes more
helpful.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Sequence patterns
• Assess the text for patterns that provide context for terms of interest
• To define a sequence pattern:
Create the individual extractors for all needed terms
Drag and drop one extractor onto another to form a sequence
• Example:
Create a dictionary called Military Ranks that includes terms such as
Warrant Officer, Sergeant, and Lieutenant.
Drag the Person extractor onto the canvas following the Military Ranks
dictionary to indicate that a new sequence finds ranks then names.
Sequence patterns
• Create individual extractors for all needed terms by extending provided
extractors, or creating dictionaries, regular expressions, literals, and so forth.
• Drag and drop one extractor to another extractor on the canvas, aligning your
cursor to reflect the order in which the term appears in the text pattern. A dark,
bold blue line to the left or right of the extractor on which you are dropping the
new extractor indicates the relative positions of the extractors. After you drop
the new extractor, a box surrounds the two extractors to indicate the
sequence. The box has a temporary title, Sequence n.
• Optional: Select the sequence on the canvas and rename it in Extractor
Properties under General.
• Optional: If needed, repeat steps 1 and 2 to add additional elements to the
pattern.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Proximity Rule
• Special type of element in a sequence pattern
• Each word or character is referred to as a token
• Specify the maximum number of tokens that might occur between the
desired terms
• Example:
Create a dictionary called Clerical title that includes terms such as Rabbi,
Father, and Archbishop
Drag the Person extractor to the right side of the Clerical title dictionary.
Right-click on Clerical title and click Add After > Proximity Rule. To
capture terms such as Archbishop of Canterbury, Robert Runcie, specify the
minimum and maximum number of tokens between words, in this case 0-5
Proximity Rule
Proximity Rule is a special type of element in a sequence pattern. It allows you to
specify the minimum and maximum number of tokens that might occur between the
desired terms. For example, suppose you want to extract clerical titles and the
person to which it belongs.
You create a dictionary called Clerical title that includes terms such as Rabbi,
Father, and Archbishop. Drag the Person extractor to the right side of the Clerical
title dictionary and let go when you see the blue bar. This will create a sequence of
these two entities. Right-click on the Clerical title and click Add After > Proximity
Rule to indicate that you want to capture phrases that occur between 0-5 tokens. In
this example, you want to be able to extract Archbishop of Canterbury, Robert
Runcie.
As a second example, select tweets that refer to Twitter names of industry analysts
with a big data term. To accomplish this, create two dictionaries, one of twitter
names of analysts and a second of big data terms and combine them on the
workspace canvas with a proximity of one to 25 tokens.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1
Generating candidates
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1:
Generating candidates
Purpose:
In this demo, you will learn how to build candidates to tailor your extractors.
User ids / Passwords
OS: biadmin/biadmin
Root: root/dalvm3
Ambari: admin/admin
BigInsights Home: guest/guest-password
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
4. Now, drag the WatsonDict extractor to the right side of the proximity rule and
combine it.
The Sequence we just created looks for terms with the word Watson following
the list of positive clues that we identified. To accurately capture all possibilities,
we need to also define a sequence that has the word Watson preceding those
same words.
Task 3. Making a copy of an extractor.
1. Right-click on the Sequence 1 extractor and select Copy from the menu.
2. Right-click somewhere on the canvas (outside of the Sequence 1 extractor) and
select Paste as New Copy to create a new sequence.
The Paste as New Copy makes a separate copy of the original extractor.
Changes that you make to original extractor will NOT affect the copy.
3. Now we need to change the order of the extractors in Sequence 1 copy 1 to
WatsonDict , 0-5, HighQDict. Drag and drop to rearrange the order.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2. Run this new Union 1 extractor. You should get 178 rows returned.
Note: If you are not getting 178 rows, check that the tags have been removed
from the Documents.
Task 5. Using regular expressions in extractors - Extra credit.
This is an extra credit task in the sense that it does not continue with the normal
Watson storyline that we have been doing. This task is to show how you can
take advantage of the regular expression extractor.
1. Create a new project. Call it Regular Expression.
2. Add a document. Click the green plus button.
3. Browse for the file located under /home/biadmin/labfiles/TextAnalytics/
4. Add the file Facts.txt
5. Examine the Facts.txt file. Open this file up on your local file system to be able
to see more of the content. The Document pane only shows a subset of the full
document.
6. About 29 lines down in the file, you should see
Geography Afghanistan.
A few lines further down, you should see
Geographic coordinates: 33 00 N, 65 00 E
You are going to create and use a regular expression extractor to extract the
geographic coordinates.
7. Click the New Regular Expression button.
8. Name this extractor, RegexExtractor.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
9. In the Extractor Properties, on the regular expression field, enter in this regular
expression that will extract the geographic coordinates.
Geographic coordinates: {1,}((\d{1,2} \d{2} [NS]), (\d{1,3} \d{2} [EW]))
10. Keeping everything else the same, go ahead and run the extractor.
11. You can view the 10 rows that were returned and see where within the file they
are located.
Results:
You have learned to build candidates to tailor your extractors. Optionally, if
you went through the regular expression section, you should be able to use
the regular expression extractor to extract texts based on the provided
regular expression.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• Understand the general guidelines for developing extractors
• List the AQL candidate generation components
• Explain the use of sequence patterns, proximity rules and union
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit 5 Filter and consolidation
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• Run an extractor and refine the results
• Eliminate duplicates and overlaps
• Filter the results
• Export the results
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Refine results
• Simplify analysis by manipulating the Output tab of the Extractor
Properties
• Rename a column in the results display
• Add a string column
• Add a transformed output column
Trim
Convert to String
Covert to Lowercase String
New Column from Single Column
New Column from Two Columns
• Hide a column from the results display
• Delete a column from the results display
Only columns manually added can be deleted using this technique
Refine results
Rename a column in the results display
• On the canvas, right-click the extractor that generated the results and click
Edit Output.
• From the column menu, select Rename or simply double click the column.
• Enter the new column name to be displayed in the results.
Add a string column
• On the canvas, right-click the extractor that generated the results and click
Edit Output.
• Click the Manage Columns menu in the left column of the table.
• Click New Column.
Add a transformed output column – for example: converting it to all lowercase
• On the canvas, right-click the extractor that generated the results and click
Edit Output.
• Click the drop-down menu in the header of the column that you want to
transform and select the type of transformation that you want to do.
Hide a column from the results display
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Important: If you hide output columns, which are part of the sequence that is being
matched, then when you use that sequence inside another sequence, the output
columns affect the match criteria for the outer sequence. For example, if you create
a sequence called Money, which is a sequence of a literal $, followed by a number,
followed by a dictionary match, and you update the output to hide the $, then if you
useMoney inside of a larger sequence, the outer sequence match for Money will not
look for the $. A better approach would be to use a filter to narrow the results to
those preceded by $.
• On the canvas, right-click the extractor that generated the results and click
Edit Output.
• Click the Manage Columns menu in the left column of the table.
• Clear the check boxes for the columns you want to remove from the results
display. These columns are hidden from the results, although the content is
still be extracted.
Delete a column from the results display
• On the canvas, right-click the extractor that generated the results and click
Edit Output.
• Click the Manage Columns menu in the left column of the table.
• Click Delete Column and select the check boxes for the columns you want to
remove from the results display.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Example of a filter
• Your Military Ranks extractor might product a match for the text Chief
Warrant Officer John Doe, but you do not want to include results that
have the word except preceding the match
Create a dictionary with the term except and any other terms you want to
exclude
Open the Output tab
Click New Filter and select Exclude
Select range and occurs after
Select the dictionary
Select the column and between 0 to 2 tokens
Example of a filter
This filters excludes any matches that have the word except within 0-2 tokens
before a match
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Note: The web tool shows only the first 1000 matches for each extractor, and those
matches are sorted by document name and span offsets. However, the Export
option will export all of the matches found for the specified extractors, and those
matches will not be sorted. The total number of matches which will be exported for
each extractor is displayed in parentheses after the extractor name in the Results
pane.
If you choose to save the files to your local machine, your browser will attempt to
download a generated file named <canvas_name>.zip, which will contain all of the
generated CSV files.Note: The default name for the zip file is always
<canvas_name>.zip. If you have your browser configured to prompt you for a
location before downloading files, you can change the name of the zip file in the
download prompt. Otherwise, the file will be immediately saved to your default
download location. If a file named <canvas_name>.zip already exists in your default
download location, the behavior is browser-dependent.
If you choose to upload the CSV files to a DFS directory, all of the CSV files will be
written directly to the specified directory, as individual .csv files. (They will not be
packaged into a zip file.) If a generated CSV file has the same name as a file which
already exists in the specified directory, the existing file will be overwritten.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Label
Develop Test Profile Export
Sample snippets
extractors extractors extractors extractors
input Find clues
documents
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1
Filtering and consolidating
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1:
Filtering and consolidating
Purpose:
In this demonstration, you will filter and consolidate the results of the
extractors
User ids / Passwords
OS: biadmin/biadmin
Root: root/dalvm3
Ambari: admin/admin
BigInsights Home: guest/guest-password
5. Rename the Sequence 1 column. On the Output tab, click the Sequence 1
dropdown and select Rename.
6. Name it WatsonSpan.
7. Click the Manage overlapping matches on the WatsonSpan column using the
Method Left to Right.
8. Run the extractor and note that there are now 124 rows returned. More
importantly, the duplicates have been removed from the results.
3. Run the extractor and see that the occurrence has been removed. 123 rows
returned.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
4. Now you can drag and drop this extractor from the catalog to use it in other
projects.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
2. Click Cancel to the Opening export.zip dialog. We are not going to export.
3. You can export the extractor as a function within BigSheets. The BigSheets
service has to be installed and started for the function to work. If you want to try
this now, go ahead and start up the BigInsights-BigSheets service in Ambari.
4. Once the BigSheets service has started, restart the BigInsights-Home service.
5. Refresh the BigInsights-Home page in the Firefox Browser. You should see
the BigSheets panel enabled (no longer greyed out).
6. Click the Text Analytics link to go back into your projects.
7. Back on the Extractor catalog on the left side, under the guest folder, right-click
the Watson Technology extractor and select Publish to BigSheets.
8. Click Next>.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
12. From the Extractors tab guest>Watson Technology right click and select
Run on Cluster.
13. A final way to use your extractor outside of Text Analytics is to run it on the
Cluster. Select the Run on Cluster option.
14. Specify the options that you wish and run the extractor. I leave this as an
optional exercise for you.
Results:
You have learned to filter and consolidate the results of the extractors.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• Run an extractor and refine the results
• Eliminate duplicates and overlaps
• Filter the results
• Export the results
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
U n i t 6 W o r k i n g wi t h p r e - b u i l t e xt r a c t o r s
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit objectives
• List the categories of the pre-built extractors
• Export extractors
• Describe tokenization and multilingual support
Unit objectives
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Pre-built extractors
• Common extractors for in various domains
• Extract specific information from input text
• Located on the Extractor catalog
• Five general categories
Named entity extractors
Financial extractors
Generic extractors
Sentiment extractors
Other extractors
Pre-built extractors
BigInsights Text Analytics comes with a catalog of pre-built extractors that are ready
to use. All you do is drag and drop them from the Extractor Catalog onto the
canvas and specify the Extractor properties as needed. Then they are ready to go.
These are some of the common extractors that are created for various domains and
you can use them without having to worry too much about how they are created.
There are five general categories of pre-built extractors.
• Named entity extractors
• Financial extractors
• Generic extractors
• Sentiment extractors
• Other extractors
More details on these extractor categories on the following slides.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Location Country
Address Continent
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Financial extractors
• Used with financed related data
Finance reports
Earnings reports
Analyst estimates
• Coverage for English input documents
Acquisition CompanyEarningsGuidance
Alliance JointVenture
AnalystEarningsEstimate Merger
CompanyEarningsAnnouncement
Financial extractors
Financial extractors, as you can guess, are used for financial related data including
finance reports, earning reports, or analyst estimates. There is only coverage for the
English input document. These extractors are:
• Acquisition
• Alliance
• AnalystEarningsEstimate
• CompanyEarningsAnnouncement
• CompanyEarningsGuidiance
• JointVenture
• Merger
You can find more details on these extractors in the IBM Knowledge Center under
BigInsights v4.1
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Generic extractors
• Used for extracting:
Generic text
Numeric information
Examples: capital words, integers
• Coverage for English input documents
CurrencyAmount Integer
Decimal Merger
FileName Number
Generic extractors
Generic extractors are used for extract text and numeric information such as capital
words or integers. Again, there is only coverage for English input documents here.
• CapsWords
• CurrencyAmount
• Decimal
• FileName
• FileNameExtension
• Integer
• Merger
• Number
• Percentage
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Sentiment extractors
• Use to extract sentiment information from:
Surveys
Domain-independent content
• Coverage for English input documents
Sentiment_Survey
Sentiment_General
Sentiment extractors
Sentiment extractors are used to extract sentiment from surveys or domain-
independent content. Only coverage for English input documents is available.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Other extractors
• Used to extract domain independent information:
Dates
URLs
Emails
• Coverage for English input document
DateTime URL
EmailAddress
NotesEmailAddress
PhoneNumber
Other extractors
There are other extractors that you can use to extract domain independent
information such as dates, URLs, emails. Only coverage for English input
document.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Exporting extractors
• Export extractor for use by external applications
Exporting to AQL
Exporting as MapReduce jobs
Exporting as BigSheets functions
Exporting extractors
AQL
When you execute extractors in the web tool, they are transformed into Annotated
Query Language (AQL) statements which are in turn then compiled and executed
against your sample documents. If you want to view, edit, or execute the generated
AQL outside of the web tool, you can use the Export AQL... option to obtain the AQL
for your extractors.
MapReduce jobs
When the web tool is running in an environment that has access to a distributed file
system (DFS), you can use the Run on Cluster option to export your extractors as
map/reduce jobs.
BigSheets functions
When the web tool is running in an environment that has access to the BigSheets
value-add service, you can use the Publish to BigSheets option to export your
extractors as BigSheets functions. Published BigSheets functions can be executed
from within the BigSheets web application, just like any other BigSheets function.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1
Analyzing quarterly reports using Text Analytics
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Demonstration 1:
Analyzing quarterly reports using Text Analytics
Purpose:
In this demo, you will work with some of the prebuilt extractors to analyze
quarterly reports.
User ids / Passwords
OS: biadmin/biadmin
Root: root/dalvm3
Ambari: admin/admin
BigInsights Home: guest/guest-password
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
7. On the Extractor Properties under the Settings tab, select the Additional
Location Names and add the E/ME/A acronym to that list.
8. Run the extractor again and you see that there are now 64 matches.
Task 2. Analyzing documents and identifying examples.
For this process, you would typically enlist the help of someone who knows the
document well to help you identify the examples of clues to search. Since you
are interested in extracting revenue by division, you must read through to find
spans of text that contain this information. Look for patterns and clues in the text
to help improve the accuracy of the extractor.
An example that you might find is a phrase such as Revenues from Software
were $3.9 billion. This has three important features:
"Software" is a division name
"$3.9 billion" is a revenue amount
"revenue"
You will use these features as context to identify instances of revenue by
division.
It is a good idea to decompose the clues to the lowest level. This allows for
flexibility and also it lets the extractor performs all the hard work of combining all
the clues. Consider that Money has three basic features, a currency sign,
followed by a number, followed by a quantifier such as million or billion.
Two patterns that you may have picked up are:
Revenues for division were $x.x
Division revenues were $x.x
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
11. Drag the Quantifier extractor into the Money extractor to complete the
sequence.
12. Run it and you should see 333 matches returned. You'll also see each tab for
the rows specific to the individual extractors.
You have now located all the instances of money. Next task is for you to
combine with revenues and divisions.
Task 4. Writing and testing extractors for candidates.
1. Create a new dictionary. Name it Revenue.
2. Add two terms to it: revenue and revenues.
3. Run this extractor to test it out. There should be 238 matches.
4. Create a new dictionary for Division names.
5. Add the following terms to it: Global Technology Services, Systems and
Technology, S&TG, Software, Global Financing, and Global Business
Services.
6. Run the extractor. You should get 142 rows.
7. Notice that in the results, the terms software and global financing are picked up
as division names. Because they are in lowercase, they are likely not division
names. The problem can be fixed by choosing the Match Case option for the
extractor.
8. Run the extractor again and you should get 98 rows now.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
You have now extracted the three key basic features: money, revenue, and
division. The next step is to extract candidates that match the two patterns that
you identified earlier. To generate candidates, you combine extractors into
sequences, building on the extractors that you created in the previous tasks.
If you remember, the two patterns are: revenues for division were $x.x and
division revenues were $x.x
The first pattern looks for examples where the word revenue is followed by a
division name and then a money amount, with some number of tokens in
between. This is the conceptual view of the first pattern.
<Revenue><1 to 2 Tokens><Division><1 to 20 Tokens><Money>
9. Add a new proximity rule with 1-2 tokens.
10. Drag the proximity rule to the right of the Revenue extractor to create a new
sequence.
11. Drag the Division extractor to the right of the proximity rule to add to the
sequence.
12. In order to create the next proximity rule, right-click on the Division extractor
and choose Add After and then Proximity Rule. Fill in 1-20 in the text box.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
You should now have linked copies of the three extractors. Remember, linked
copies are affected when you change one. If you needed a new copy, you
would select Paste as New Copy. Notice that the linked copies are the same
color.
20. Create a new sequence with these three extractors and proximity rules to create
an extractor for the second pattern. Name it Division Revenue.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
22. Union these two extractors together to yield a full picture. First, the columns
must match. Modify the output specification to make them match.
a) Go to the Output tab of the Properties pane.
b) Deselect the Revenue column. We don't need this column. Click on the
dropdown next to the green plus.
c) Rename the Division Revenue column to match.
d) Rename the Money column to Amount.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
3. Choose output column match and Method Not Contained Within. This
specifies that we only want matches which are not contained within another
match.
4. Run the extractor again and verify that none of the 49 matches are contained
within another.
5. Look at the results again. Notice that there are two values for the Software
division in the 4Q2006.txt file. Looking more closely, one of these results was for
4Q, and the other for the full year.
6. On examining the document, we see that the unwanted results have their
Money amount within a proximity of 1200 tokens from a phrase like Full-Year
2006 Results. To match multiple years, we can create a regular expression
to match this clue for unwanted results. Click the New Regular Expression
button.
7. Type in FullYear
8. Type in Full-Year \d{4} Results as the regular expression.
9. Run the regular expression to test it. You should see five results, one per
document.
10. Select the Divisions and Revenues extractor. Under the Output tab, click the
New Filter button.
11. Click Exclude, because we want to exclude some rows.
12. Choose the Amount column, the range type, and the occurs after option.
13. Select the FullYear extractor and FullYear Column choose between and fill in
1 and 1200 for the tokens.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
14. Run the extractor and verify that only 25 results should be shown now.
This view contains exactly the information that you need for further analysis.
When you apply text analytics to more complex documents, and when you are
extracting more sophisticated information, you would expect to spend time
improving the precision and recall of your extractor. You can also profile your
extractor to understand and improve its performance characteristics.
Task 6. Finalizing and saving the extractor.
1. Click on the extractor and click the Save icon.
2. Select the guest category to save the extractor.
3. You can choose to Export AQL, Publish to BigSheets or Run on the Cluster.
Results:
In this demonstration you have learned to use some of the pre-built extractors
to analyze IBM quarterly reports to figure out the revenues from each division.
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
Unit summary
• List the categories of the pre-built extractors
• Export extractors
• Describe tokenization and multilingual support
Unit summary
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE