9.OSINT Documentation
9.OSINT Documentation
2
1.1 Quick Start Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Adding a Custom Entity Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Creating effective search queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Editing the current Name Variant Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Importing a Name Variant File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.5 Using the Duplicate Bookmark Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.1 Crawling a Web Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Creating a Case Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.3 Creating a Category Definition File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.4 Creating a Configuration Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.3.5 Creating a Custom Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.3.6 Generating a Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.3.7 Defining a Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.3.8 Defining the maximum number of search result links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.3.9 Downloading of Bookmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.3.10 Encoding a File in UTF-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.3.11 Filtering Search Engine Result Bookmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.3.12 Importing Bookmarks from web browsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.3.13 Importing Documents from Local Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.3.14 Installing on Linux (Ubuntu) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.3.15 Installing on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.3.16 Performing an Internet Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.3.17 Performing the Entity Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.3.18 Requesting Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.3.19 Running on Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.3.20 Running on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
1.3.21 Setting HTTP Proxy Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.3.22 Using Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.3.23 Using the Category Browser view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.3.24 Using the Entity Browser view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.3.25 Using the Graph view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.3.26 Installing on Mac OS X using VirtualBox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.4 Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.4.1 How can I license the EMM OSINT Suite? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.4.2 How can I recover data from a corrupt workspace? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.4.3 Result Link Extraction works, but the download of the resulting Bookmarks fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.4.4 Shall I download the 32-bit or 64-bit version? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.4.5 The application startup fails due to "Companion shared library not found" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.4.6 What are the System Requirements for OSINT Suite? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.4.7 What is the EMM OSINT Suite? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1.5 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1.5.1 Boolean Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1.5.2 Intelligence Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1.5.3 Internet Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.5.4 Internet Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.5.5 Open Source Intelligence (OSINT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.6 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.6.1 Category Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.6.2 Entity Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.6.2.1 Name Variant Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.6.2.2 Geo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.6.2.3 Regular Expression Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.6.2.4 Entity Guessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.6.2.5 Entity Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.6.3 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.6.4 Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.6.4.1 Data Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1.6.5 Text Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
1.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.7.1 1. Installing EMM-OSINT Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.7.1.1 Linux (Ubuntu) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.7.1.2 Microsoft Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.7.1.3 MacOSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.7.2 2. Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.7.3 3. Module Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.7.3.1 C1 - Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.7.3.1.1 C1 - Lab Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
1.7.3.2 C2 - Entity Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
1.7.3.2.1 C2 - Lab Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
1.7.3.3 C3 - Entity Extraction Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
1.7.3.3.1 C3 - Lab Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
1.7.3.4 C4 - Reporting & Data Export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
1.7.3.4.1 C4 - Lab Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
1.7.3.5 C5 - Category Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
OSINT Documentation
Quick Start Guide
Tutorials
Tasks
Concepts
Introduction
The EMM OSINT Suite is a software package to help you search the Internet or a collection of files on disk. You can download search results
from web search engines and then find relevant documents in the set of downloaded documents. In addition you can import documents from
local disk for analysis. The software can process the most popular file text and binady formats, such as HTML, PDF, MS Office and others.
The core of the software is the entity extraction module which matches text locations against pre-defined patterns for different type of entities,
such as person, organisation and place names, credit card numbers, VAT identifiers, URLs, etc.. User defined patterns can be added to find
investigation specific entity types, such as number plates or tax identifiers.
The Entity Browser view shows you the extracted entity information and allows you to find documents where a specific entity was found. (In
the beginning this view is empty, we first need to add some documents and run the entity extraction).
The editor area in the centre of the application window hosts multiple editors at the same time (after startup it is empty). It is used to edit files
of the workspace or to interact with the web using a browser view.
The different views in the bottom view area and in the right view area show information related to files and objects selected in other views or
editor windows (Properties view) or show status information (Console view) or the progress of background operations (Progress View).
You can close views and later open them again by using the main menu: Window > Show View and then select the desired view
The EMM-OSINT Suite contains a function to create reports of the extracted data. A report can be used to export data from the software and
open it in another program. The system can use a report template (there are a few basic ones already included), enriches the template with
the extraction data and produces an output file, such as a HTML file. Therefore, there are two possibilities for creating a Report of the data:
1. Generating a Report
2. Creating a Custom Report
Tutorials
Adding a Custom Entity Type
The system provides a way to add additional custom entity types to the basic predefined types that already ship with the software (e.g.
person, organization, location). A custom entity type is an additional type which extracts data not covered by the predefined types in the
system.
This tutorial describes how to add a custom entity type for Swedish number plates.
Introduction
Vehicle registration plates of Sweden are used for most types of vehicles and have three letters first and three digits after, if read from the
left. The combination is simply a serial and has no connection with a geographic location, although the last digit shows what month the car
has to undergo vehicle inspection. Vehicles like police cars, fire trucks, public buses and trolley buses use the same type of plate as normal
private cars, and cannot be directly distinguished by the plate alone. Military vehicles have special plates.
The only possible coding to be seen by looking at the plate alone is when the vehicle must undergo inspection. The last digit of the plate
denote this.
1 January November-March
2 February December-April
3 March January-May
4 April February-June
5 July May-September
6 August June-October
7 September July-November
8 October August-December
9 November September-January
0 December October-February
All letters in the Swedish alphabet are used, except the letters I, Q, V, Å, Ä and Ö. 91[1] letter combinations are not used since the may be
politically offensive or otherwise unsuitable.
To add a custom entity type to the system to recognize Swedish number plates, perform the following steps:
Inside the Active Entities folder, rename the newly copied file:
Right click the number-plates.xml file, then click on Open With > Text Editor.
The file will be opened in text editor in the editor area. The file contains a lot of comments to explain how to fill in the different tags.
"id" is mandatory; it is a code of at most two characters, and it must be unique in the OSINT Suite namespace. The letters
p, o, u, t are already used for the internal types. We suggest choosing a two-characters code which is not yet used in any
other custom entity definition xml file
"description" is mandatory; it is a free text to describe the data type, this description is show in the user interface to
denote the entity type.
In the text editor navigate to the <expressions> tag and add a new <expression></expression> child tag to hold the pattern definition as
follows:
<expression>
<regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
<regex> is mandatory; it defines (when mode is not set, or when mode="basic" is set) (in the cdata section) the regular
expression pattern, expressed according to the syntax of dk.brics.automaton library [http://www.brics.dk/automaton/doc/dk
/brics/automaton/RegExp.html]
<description> is optional; it is a free text to describe the pattern
within an <expression> tag, 0 or more <output> tags can be specified (likely at least 1); each <output> adds a piece of
information as meta data to the tag which defines the found entity in the meta data of the file.
there must be an <output> label which identifies the data type of the term that matched the pattern; in the above
example this is <output key="type" value="pn"/> which tells the system that any term matching that pattern is of data type
"pn" (plate number)
Pattern Syntax
By default the system uses the syntax of the BRICS library (see http://www.brics.dk/automaton/doc/dk/brics/automaton/RegEx
p.html ) which is less expressive than the normal java.util.regex package. If you want to use the full java.util.regex syntax please
set the mode attribute to "groups": <regex mode="groups">
This pattern will match three uppercase letters, optionally followed by a space or a dash, followed by a number between 000 and 999. This is
a list of example terms that would match the above pattern:
WNF766
WNF 766
WNF-766
As can be seen, the pattern (regular expression) used as example for recognizing Swedish number plates and defined within the <regex>
label is
[A-ZÅÄÖ]{3}[ \-]?<000-999>
Pattern Meaning
In our above example, let's say we have a document which contains some text interspersed with some number plate terms as follows:
If the Entity Extraction process analyzes this text, it will produce the following tags to be included in the meta data of the file:
In other words, the system thinks it has found three different number plates (see the different id values), even though they are only spelled
slightly differently and describe the same number plate. In order to overcome this problem we need to output the matched terms in a
standardised form.
Following the example above about recognizing Swedish number plates, the following steps are explained:
Standardizing the output name of the entity. This would be interesting to apply when OSINT finds out different entities that are not
spelled exactly in the same way, but they refer to the same entity
Defining capturing groups in the pattern. The current library in OSINT for regular expressions doesn't support capturing groups.
To avoid that, we can set the feature "mode" on the <regex>
Using keywords as patterns. Basically, a text file containing the keywords is used to define the exact terms to recognize
Using a script to generate the patterns on-the-fly. Following a Java style of coding, OSINT allows defining own scripts to
recognize entities
The output key called "name" of the entity recognized by OSINT can be standardized. Returning to the example of the regular expression
(pattern) defined to recognize Swedish number plates:
<expression>
<regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
the output in OSINT for a text including "... WNF766 ... WNF 766 ... WNF-766 ..." would be:
Notice how there are three different entities with different "name" output keys. If we want to standardize this key in order to OSINT shows the
same entity (WNF766 for instance) for the three cases, we should add the "name" output key within the definition of the pattern as follows:
<expression>
<regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
<output key="name"><![CDATA[
name = term.replaceAll("[ \\-]", "");
return name;]]>
</output>
</expression>
Now, if the Entity Extraction module processes the text again, it will now produce the following meta tags:
As shown, now the output for the "name" key is unified for all the entities.
By default, the current library in OSINT for regular expressions doesn't support capturing groups within the definition of the pattern. To avoid
that, we can set the feature "mode" on the <regex> as follows:
<expression>
<regex mode="groups"><![CDATA[([A-ZÅÄÖ]{3})[ \-]?(<000-999>)]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
Notice how the "mode" key is added to the <regex> label and set up to "groups" in order to allow the definition of groups (using brackets) in
the regular expression.
Then, we can use the defined groups for further processing, referring to them as groups[1], groups[2], etc.
In the example above, if OSINT finds the entity "WNF766" in the text, the variable "groups[1]" would refer to the string matched by the first
group defined in the regular expression (i.e. "WNF"), while "groups[2]" would refer to the string matched by the second group defined ("766").
Therefore, these "groups" might be used within a script that we can also define for generating patterns on-the-fly, as explained below.
In OSINT we can use keywords as patterns by using an external text file in which the keywords are included. To use this feature we have to
set the "mode" key within <regex> as follows:
<expression>
<regex mode="file"><![CDATA[keywords.txt]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
Therefore, we have to define within the CDATA section the relative path (starting from where the XML file which defines the custom entity is
loaded) to a file which contains keyword terms. The file which contains keyword terms must be a text file, in UTF-8 format. It is important to
note that each line (which is not an empty line nor a comment line) is considered a keyword term, case sensitive, and it will be matched as-is
(no need to escape the special characters). An example of the "keywords.txt" file would be:
#comment lines begin with # and are ignored
#empty lines (as the one below) are also ignored
#if possible, the first and the last line of the file
#should be either an empty line or a comment line
#keywords start here
WNF766
wnf766
WNF-766
wnf-766
Taking into account the "keywords" file above, for the example including the text "... WNF766 ... WNF 766 ... WNF-766 ...", OSINT would
produce two xml elements as such:
Whitespaces are allowed inside the keywords defined within the "keywords" file
Following a Java style of coding, OSINT allows defining own scripts to recognize entities. To use this feature we have to set the "mode" key
with the value "script" within the <regex>, as follows:
<expression>
<regex mode="script"><![CDATA[
List<String> ls = new ArrayList<String>();
ls.add("WNF[ \-]?[0-9]{3}");
ls.add("UCD[ \-]?[0-9]{3}");
return ls;
]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
As can be seen, the script (in the <regex> CDATA section) is in the Java language. Some features related to the "script" mode are:
the script can reference the path from where the XML definition file is loaded as "resourcespath" ("resourcespath" is a String)
the script must return a List<String>, where each element in the list is a regular expression pattern
For the example above, the script will recognize entities such as "WNF777", "WNF-876", "WNF 987", "UCD465", "UCD-999" or "UCD 112".
When you do a Google search you are not searching in the entire web, you are searching on the Google Index. Google uses software
programs called web spiders that go through web pages following the links and storing all the information across hundred of thousands of
machines they have. Currently, many billions of pages are stored by Google. When we type a query and hit return, the Google software
searches its large index to find every page that includes those terms. The index which is queried may be different
But, how does Google decide which few documents I really want? By taking into account over 200 parameters for each document in its
index:
How many times does this page contain your keywords?
Do the words appear in the title of the page? in the url?
Does the page include synonyms for those words?
Is this page from a quality website or from a spam one?
What is its page rank? (formula invented by Google that measures the importance of a web page by looking at how many outside
links point to it and how important those links are)
etc.
They combine all these factors to produce the overall score of each page and return the search results.
Exercise C1-2
Search for drug trafficking on The Guardian newspaper website
Search for pdf documents contain the exact phrase "Al Qaeda"
Search for documents about terrorist attacks or plans but not containing words such as "United States" nor "Iraq"
Related information:
http://www.powersearchingwithgoogle.com
Power Searching with Google Quick Reference
The suite allows editing its Name Variant Database in order to add new entities or modify existing ones, keeping the current keys assigned
for each entity.
KEY. The import process does not take the values under this column into account because it already assigns its own primary key for
each new entity. However, this column must appear as the first column within the TSV file, although their values are discarded.
PID (Profile Identification). It is the identification value of an entity. This numerical value is very important. For variants (different
names for the same entity) that belong to the same entity, this value must be identical. The first occurrence found in the TSV file will
be considered as the canonical entity (original name form of the entity) and the following ones as variants of this canonical form (see
the example below).
TYPE. It is the type of the entity. OSINT accepts four main entity types:
o, for organizations
p, for persons
t, for toponyms (locations)
u, for unknown types of entities
VARIANT. It is the form or name of the entity, exactly written as you want that the process matches it in the documents.
Next, an example of a excerpt from the Name Variant Database in OSINT is shown:
2 11 p Aaron Albert
3 11 p A. Albert
4 11 p A. M. Albert
6 21 o CCC
7 21 o C.C. Christian
8 41 t Milano
9 61 u Harold Hugh
10 61 u Henry Hugh
In this example, it can be observed how the entity Aaron Albert (person) has a PID value of 11. The first occurrence would be the canonical
(original) form for that entity, whereas the next ones found with the same PID (A. Albert, A. M. Albert) are considered as variants of that
canonical form. However, all these occurrences (variants) represent the same entity in real life (the person Aaron Albert). Another example in
the table is the organization called Chad Calvin Christian (PID 21). As can be seen , there exist one canonical form and two variants (CCC, C
.C. Christian) for this entity. Finally, we find the entity Milano (PID 41) with only one variant (the canonical form) and one entity of unknown
type (Harold Hugh) with two variants.
It is important to note that it should be used a UTF-8 flat file with TSV (Tabular Separate Values) format.
The procedure of editing the current Name Variant Database should be done in three steps:
The exporting process will finish when this progress bar disappears.
Opening the database file and adding new entities or modifying existing ones
The new export file generated is a flat file composed of four columns separated by the tabular character (see above).
Any text editor can be used to modify the database file (TextPad for Windows, WordPad, ...).
Once we have added or modified the new entities, as explained above, the database file must be saved in UTF-8 format and always must
keep the four columns format.
The file is a TSV (tab- separated values) file. The file should contain four columns:
KEY. The import process does not take the values under this column into account because it already assigns its own primary key for
each new entity. However, this column must appear as the first column within the TSV file, although their values are discarded.
PID (Profile Identification). It is the identification value of an entity. This numerical value is very important. For variants (different
names for the same entity) that belong to the same entity, this value must be identical. The first occurrence found in the TSV file will
be considered as the canonical entity (original name form of the entity) and the following ones as variants of this canonical form (see
the example below).
TYPE. It is the type of the entity. OSINT accepts four main entity types:
o, for organizations
p, for persons
t, for toponyms (locations)
u, for unknown types of entities
VARIANT. Is the name variant of the entity. The matching of the name variant is not exact but matches according of some rules.
The import file contains name variants as the fourth column. These variants are matched against the real text using the following rules:
Lower case matches both cases If the name variant is imported as lower Name variant "procter and gamble"
case, it matches both upper and lower case
in the text matches
Upper case matches only upper case If the name variant contains upper case Name variant "Procter and Gamble"
characters, these characters will only match
upper case matches
characters in the text.
"Procter and Gamble" but not "procter and
gamble"
matches
Some characters are ignored There are a number of special characters Name variant "Procter & Gamble"
which will be ignored. These characters are
"." (dot), matches
"&" (ampersand), ":" (colon) and "-" (dash).
"Procter & Gamble", "Procter - Gamble",
"Procter:Gamble",
"Procter Gamble", etc.
Using wildcard character '%' The percentage character will match zero Name variant "Procter%"
or more characters.
matches
"Procter&Gamble", "Procter-Gamble",
"Procter:Gamble"
Here is an example of a TSV file used for importing new name variants into OSINT:
2 11 p Aaron Albert
3 11 p A. Albert
4 11 p A. M. Albert
6 21 o CCC
7 21 o C.C. Christian
8 41 t Milano
9 61 u Harold Hugh
10 61 u Henry Hugh
In this example, it can be observed how the entity Aaron Albert (person) has a PID value of 11. The first occurrence would be the canonical
(original) form for that entity, whereas the next ones found with the same PID (A. Albert, A. M. Albert) are considered as variants of that
canonical form. However, all these occurrences (variants) represent the same entity in real life (the person Aaron Albert). Another example in
the table is the organization called Chad Calvin Christian (PID 21). As can be seen , there exist one canonical form and two variants (CCC, C
.C. Christian) for this entity. Finally, we find the entity Milano (PID 41) with only one variant (the canonical form) and one entity of unknown
type (Harold Hugh) with two variants.
Finally the system imports the new database and prints a status message in the Console view.
Before starting the import process, the system generates automatically a backup of the current database under the workspace
folder
under <workspace>./metadata/.plugins/it.jrc.osint.extract/entity_20_backup_<date>.h2.db .
If something goes wrong, close the application and copy this backup file back into place over entity_20.h2.db.
Upcoming Feature
The new format of the import file will be introduced with version 2.4 of the software
In order to facilitate encoding of name variants a more user-friendly format of the import file will be introduced shortly
Basic Requirements
The file consists of multi-line blocks, where each of such blocks provides the canonical and variant forms of one single entity.
Each block starts with a line including the canonical form and the corresponding entity type information, which are separeted with a tab. All
subsequent lines in the same block
start with a tab, which is followed by variant form of the current entity.
Here is an example of an import file that contains 3 blocks corresponding to three entities.
World Anti-Doping Agency ORG-PP
Agenzia Mondiale Antidoping
Agence Mondiale Antidopage
Weltantidopingagentur
Agência Mundial Anti-Doping
Amsterdam Airport Schiphol LOC-FA
Aéroport de Schiphol
Aeroporto di Amsterdam
Schipol International Airport
Aeroportul Internaional Schipol
George W. Bush PER
George Walker Bush
President Bush
President George W. Bush
George W Bush
Bush the Younger
George Bush Jr.
The first block contains 5 name variants (including the canonical form) of the entity World Anti-Doping Agency (of type ORG-PP -
political/public organisation). The second block
contains 6 name variants of the entity Amsterdam Airport Schiphol (of type LOC-FA - facility). The third block contains 7 variants of the entity
George W. Bush (of type PER - person).
More information on new entity types will be provided shortly.
Please note that in case of ambiguous entities (i.e., entities that can have more the one type) one introduces a separate block of variants for
each specific entity type.
As an example we search for a prominent person and combine the search results into a single folder.
Double Click on Bing in the Search Tool View, the application opens a browser window pointing to http://www.bing.com
Search for Barack Obama:
Select from the main menu Web > Extract Search Result Links to extract the result links of the search and to store them as
bookmarks in the project:
The system extracts the first 100 result links (you can change the maximum number of extracted links under Window > Preferences > OSINT
> Link Extraction). The links are stored in a newly created folder ("Bing_<Time Stamp>") under the bookmarks folder of the case project.
Each result link is stored in a bookmark file which is a XML file containing the following data: URL, Title, creation time
stamp, search query, search engine). If you open the Properties View which is located next to the Console View (under the
browser window) and select a bookmark file you can see the data contained in the file.
In this example the system has already detected a duplicate bookmark which is marked with a (d) in front of its title.
By default the system shows the title of the bookmark file and not the file name. You can change this behaviour if you click
on the small triangle in the upper right corner of the workspace navigator view and select "Customize View". Then change to
the Content tab and switch off "Bookmark Title", confirm with "OK".
Now we search again but we use Google as search engine. As the end result we have two sub folders in our bookmarks folder:
The first folder contains the Bing search results, the second one contains the Google search results. Before we download the pages behind
these search results, we want to eliminate duplicate bookmarks.
Duplicate Bookmark
A bookmark file is considered a duplicate of another bookmark, if it contains the same URL and has a newer creation time stamp.
Organizing Bookmarks
In order to organize the bookmark files and delete duplicate bookmarks, do the following:
The deletion is only performed on the selected folder and its sub folders. It does not affect bookmarks stored above the selected folder.
The system now deletes all duplicate bookmarks from the Bookmarks folder and shows the results in the Console view. The markup of
duplicate bookmarks dissapears from the bookmarks folder and its sub folders:
The duplicate bookmark detection is also helpful if you want to repeat a search over time (for example each week) and see which search
results are new. If you simply delete the duplicates only the new results remain.
Tasks
Crawling a Web Site
Prerequisite: Creating a Case Project
The application contains a crawler component which can be used to crawl (often also called "spider") a targeted web site.
The crawler component starts at a set URL and then follows the links on this web site until a predefined depth has reached.
In the Workspace Navigator view expand the Crawler folder and right-click
the Crawler folder, then click New >
Crawler Configuration. An OSINT Crawler Configuration creation dialog opens.
In the OSINT Crawler Configuration dialog enter a file name and click Finish.The configuration file is created and opened in a
Crawler Configuration editor.
Targeted Websites list of URLs The list allows you to add start URLs of
web sites which should be crawled.
Max Depth number The maxium number of links to follow,
default is 1 (crawling all pages connected to
the start URL)
Minimum Text Size number The minimum extracted text size, default is
200 characters. If a page has less text, it
will be ignored.
Random Delay (ms) number The waiting time between requests done by
the worker threads. Default is 2500
milliseconds.
1. In the Targeted Websites section click Add. The Add a new URL dialog opens.
2. In the Add a new URL dialog enter a full URL, such as http://www.europa.eu and click OK. The added URL appears in the Targeted
Websites list box.
To save the Crawler Configuration Editor use the main menu and click File > Save.
Performing a Crawl
To perform the crawl using the Crawler Configuration file created do the following:
Crawling Speed
Using the default settings, the crawling is rather slow. This avoids being "black listed" on the target site.
The crawler module starts in the background, it creates for all crawled pages Bookmark files in the Bookmarks folder.
Downloading Speed
Like the crawling also the download of pages refered to by the Bookmark files is done in a slow fashion.
In the main menu click File > New Wizards > Case Project. A project creation dialog is opened.
A Category Definition File is a file type used to define a category or alert within the OSINT Suite. This type of file should be created under
the Category Matching folder that can be found within a Configuration Project. Once a Configuration Project is created, two different
sub-folders appear within the Category Matching folder:
Active Categories sub-folder: contains the active categories that can be currently selected in the category selector shown in the
Category Browser view.
Available Categories sub-folder: contains other category definition files that can be used later as active categories. The categories
defined under this sub-folder are not available in the category selector of the Category Browser view.
To create a Category Definition File do the following:
Right click on one of the sub-folders mentioned above (Active Categories or Available Categories) and click on New > Category
Definition File. Choose a file name and click on Finish.
The Category Definition File editor is automatically opened showing the empty content of the new file. The file extension used for
Category Definition files is adf (alert definition file), as shown in the workspace navigator:
The OSINT Suite also allows using category definition files generated by the official EMM Alert Editor (http://emm.jrc.it/AlertEditor/).
Files generated by the EMM Alert Editor are basically XML files that conform the XML Schema Definition (XSD) provided by the
EMM team to define categories or alerts. The alert XSD is available at http://emm.jrc.org/alert.xsd
To know how to define categories or alerts within a specific category definition file see Defining a Category.
In the main menu click File > New Wizards > Config Project. A project creation dialog is opened.
Enter the name of the project in the Project name field, preferably name it "Configuration"
Click Finish. The configuration project is created in the workspace.
The configuration project appears as a top level folder in the workspace. Its icon is marked with a small red "c" to show that this it is a
configuration project.
Tip
A configuration project contains templates and configuration information which is needed by different modules of the system.
These modules use the settings and templates to process data in Case Projects. Even though, it is possible to create more than
one configuration project (only one is active at any given time as defined in the Preferences), it is recommend to create only one c
onfiguration project to keep things simple.
Creating a Custom Report
Reports are a way to export analysis data from a Case Project into flat files. These files can be HTML documents or machine readable
formats such as tsv files (tab separated value - to be imported into MS Excel).
If you open the application, you see in the lower left corner a Reports view which may be hidden behind the Search Tool view:
This view shows the currently available reports in the system. For example, if you double-click on AllEntities.ort report, the system creates a
HTML report in the current project which shows a list of all found entities.
However, some reports need to be customised for your project in order to perform as desired or you need to create a custom report from
scratch. This tutorial gives an introduction on how to create a custom report.
Procedure:
See also: Quick Start Guide for descriptions of the individual tasks.
In the main menu click on File > New Wizards > Configuration Project
Fill in "Configuration" as project name and click on "Finish"
The system now creates a new project which is marked with a small red "c" to show that this is not a normal Case Project but rather contains
configuration data and templates.
The following sreenshot shows this configuration project with the "Reporting" folder expanded:
A configuration project contains templates and configurations which apply to all case projects in the system. It is possible to
create more than one configuration project in this case only one is active. (You can choose the active one from the
Preferences). However, in practice we recommend to create only one configuration project.
The system provides a report template editor to edit the definitions of the template. We open the "RelatedDocuments_Entity.ort" file by
performing a double click on it. It opens in an editor window:
The fields defines the needed template file and optional data preparation script to compile a report:
Field Description
Output File Extension Defines which file extension the generated report will have. This
can be for example html if the report creates a html file or csv for a
file readable by MS Excel.
Template Path The template which is used to generate the report. During report
generation the template is filled with the analysis data from the
selected case project.
Script Path An option javascript file which can be used to pre-process the data
before filling it into the template. For example, a list of entities could
be sorted or filtered before it is used in the template.
Note: The scripting will change in a future version of the software,
use it as little as possible.
The default report template creates a report showing all documents related to "Franz Marc" (which is the example name we use in the quick
start guide). We want to change this and generate a report which shows all documents related to "Barack Obama".
Now, as second step we do the same for the javascript file belonging to the report template:
Right click on "SelectEntity.js" in the Workspace Navigator View and select "Copy"
Right click on the "Reporting" folder and select "Paste"
The system opens a dialog, we change the name of the copied file to "SelectBarackObama.js"
The javascript is used to select the needed data for the report. It defines that our main entity is "Barack Obama". (In the next major version of
the software this will be done by using a data selection dialog.)
As a third step, we edit the new report template file (double click on "RelatedDocuments_BarackObama.ort" and insert the following data:
Field Input
We save the report template file (Main Menu > File > Save).
As a final step, we modify the javascript file "SelectBarackObama.js" to select Barack Obama as main entity for our report:
Now, the javascript editor opens and we change the script to the following:
Select Entity
/**
* JavaScript to select an entity and store it into context of
* report template.
*/
var selectedEntity = project.getEntityByName("Barack Obama");
templateContext.put("selectedEntity", selectedEntity);
This selects the entity named "Barack Obama" from the selected project and stores it into the template context. During generation the system
replaces the $selectedEntity variable in the html template "RelatedDocuments_Entity.html" with the entity for Barack Obama.
After saving the javascript file, we can test it by doing a right mouse click on "RelatedDocuments_BarackObama.ort" and selecting "Generate
Report".
Note: This complicated procedure to edit the javascript file manually to select data will go away in the next major version of
the software. Instead a selection dialog will be used (similiar to the one to select the source project) to select the entity (or
other data) for the report
List all documents related to the entity named "Barack Obama" and similar named entities
The entity extraction engine has a normalisation step which tries to match similar names and combine these name variants to belong to a
single entity. This way we avoid to have too many entities which represent the same person. However, there is a limit for the system to
decide which names are similar enough. If we look at the entity browser view of our application we see that there are quite a few entities
listed which represent "Barack Obama":
We now want to improve our report to include also documents which contain one of the different entities above. (Please note, that by editing
the name variant database of the software, we can avoid these different entities about the same person. How to do this will be covered in a
different tutorial).
Change the data selection script "SelectBarackObama.js" to obtain a list of entities with names starting with "Bar"
Change the output template "RelatedDocuments_Entity.html" in a way that it lists documents of a list of entities not of a single one
The javascript file "SelectBarackObama.js" defines which data is avaiable to the report generator to include it in the final report. In a first step
we modify it to include all entities with names starting with the letters "Bara". (We defined these four letters by looking at the entity browser
view.)
We open the "SelectBarackObama.js" javascript from the configuration project by performing a double click on it:
We see that the selectedEntity variable is defined in a first step by calling the function property getEntityByName of the project object.
This object is a predefined object which is the entry point into the analysis data of the source project for the report. The source project in turn
is selected during report generation from a selection dialog.
Now, if we want to define a list of entities with names starting with "Bara". We look into the online documentation to find a suitable function of
the project object which provides this list:
From the main menu open Help > Help Contents > OSINT Suite User Guide > Reference > Reporting > Data Objects
Review the table showing the object properties of the project object
The function property project.getEntitiesByNamePattern(namePattern) seems to be suitable for our purposes. It needs a regular
expression pattern as parameter to match against the available entities in the system.
Note: in order to match "Bara" as the start of an entity name we provided the pattern "Bara.*" which is a regular expression pattern.
Soon, we will provide a tutorial to show you how to write these patterns to match text.
After selection of the entities, we store them in the templateContext which defines a set of data available to the report output template.
After changing the data selection script, we need to adapt the output template "RelatedDocuments_Entity.html" to show all documents
related to a list of entities and not only to a single entity.
We do the following:
Now, we open the new "RelatedDocuments_Entities.html" file in a text editor (right mouse click > Open With > Text Editor) and edit it.
Since we use internally Apache Velocity as templating engine (see the User Guide), the output template consists mainly of html code with
variables starting with $ and directives starting with #:
HTML Template
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Related Documents of Entity $selectedEntity.Name</title>
</head>
<body>
<h2>List of related documents of $selectedEntity.Name:</h2>
<ul>
#foreach( $doc in $selectedEntity.RelatedDocuments )
<li><a href="file://$doc.FilePath">$doc.Title</a></li>
#end
</ul>
</body>
</html>
The current output template has a foreach loop directives which loops of the list of all documents related to the selectedEntity (this is the
variable we have defined in the javascript).
Now, in order to show all documents of a list of selectedEntities (see javascript above), we need to add another foreach loop directive to first
loop over all entities and then internally to loop over all related documents. The resulting code of the output template looks like this (not the
new #foreach directive):
HTML Template
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Related Documents of a list of entities</title>
</head>
<body>
#foreach ($selectedEntity in $selectedEntities)
<h2>List of related documents of $selectedEntity.Name:</h2>
<ul>
#foreach( $doc in $selectedEntity.RelatedDocuments )
<li><a href="file://$doc.FilePath">$doc.Title</a></li>
#end
</ul>
<br/>
#end
</body>
</html>
After saving the new output template, we can test our new report by selecting it from the config project with the right mouse and choosing
"Generate Report".
Please find all files of this tutorial attached to this page. You can simply unzip them to disk, then select from the main menu File > Import >
Documents > File System and import them to your configuration project.
Generating a Report
To generate a report, perform the following steps:
Once we have selected the desired Case Project from the list of available projects, the system creates the report and shows the
progress in the Progress View. After the report has been generated it will be available from the Workspace Navigator view.
In the image above, the results for the predefined report "AllEntities" are shown in the Document Viewer view.
Defining a Category
Category names
Patterns section
Words threshold and keywords weight
Combinations section
Proximity
Wildcards
Uppercase/Lowercase definitions
Useful tricks
Examples
EMM OSINT Suite provides a Domain-Specific Language (DSL) to define a category or alert by using the Category Definition File editor
view. A Category Definition File (adf file) consists of two different sections mainly:
Patterns section, which is a simple keyword-weight list with a defined threshold (also optional).
Combinations section, which is a list of keyword combinations and optionally a proximity attribute.
The keywords in the patterns section determine whether a text is classified to fit a category or not.
Category names
Regarding the names to be used as category names, they should be unique in the system. This means that a name can only be used for
one category exactly. It should contain only non-accented alphanumerical characters. The name is case sensitive but case should
nevertheless NOT be used to distinguish between categories, i.e. category 'myTest' and 'mytest' should not be used at the same time.
Patterns section
The first way to define an alert is to use a list of keywords with associated weights. This simple keyword method is preferred for
performance reasons and it is very effective if you are looking for precise, unambiguous terms or names (e.g. Gazprom Media, brucellosis, M
ichael Phelps). If a precise term consists of two or more words, you can use the wildcard “+”. For instance, if you are interested in “yellow
fever” you do not want all the documents containing the word “yellow” and “fever” individually anywhere in the document, but you want them
to appear together and in the same order, so you can use the “+” symbol (“yellow+fever”). “+” effectively skips the white space between the 2
words. Note that “+” also skips punctuation marks, so “yellow, fever” would be valid.
An important feature of the Category Matcher is that it is multilingual. If the users want to categorize articles in various languages, they have
to define their alerts also in different languages. The system will accept terms in any language, and an alert may have any numbers of
keywords.
The value of the words threshold for each alert and the weight for each pattern can be chosen by the user and both are optional. If the user
wants to define one, they have to be integer values. Eventually, the system keeps track of the total weight of the individual patterns, and
only if the threshold set by the user has been reached, the document will be categorized to the category. The words threshold is ONLY used
within the current definition. The value has no particular meaning other than to check against the total of the values of the patterns found in
the text. The word weight list can also be used with weights less than the threshold value.
The system already takes into account the same pattern for a maximum of 8 occurrences. It assigns a decreasing weight based on the
following multipliers: 1.0, 1.0, .6, .4, .4, .2, .2, .2. So, if a pattern matches multiple times, the system will automatically reduce the value of the
weight and one pattern will never score more than 4 times its full value.
Combinations section
The second way to define an alert is to create one or more combinations of lists of keywords. A combination section is composed of one
or more "OR" lists of patterns (or sub-sections) and optionally one "NOT" list of patterns (not sub-section). When a combination section is
defined by the user, at least one of the keywords belonging to each or sub-sections must be found in the document to assign such category
to it. Obviously, if any of the keywords defined within the not sub-section is found, then the document would be automatically discarded
although some or combinations were found. Therefore, the "NOT" list of patterns means "unwanted words".
One should use a combination for a broader term or concept (e.g. imported disease, release of toxic substances, equal rights or such
combinations as Russian peacekeeping mission in Caucasus, Russian Georgian Conflict). As explained above, each "OR" list of patterns
would express a certain concept and a document would be considered for the alert if every concept is found in the document but rejected if
the “NOT” concept is found.
Proximity
The “proximity” value is optional and it can be very handy. It will define a word context size within which the combination terms have to occur.
If this value is not defined by the user, then the Category Matcher will use 10 as default value.
Wildcards
The Category Matcher allows using several wildcard characters:
% (percent) for 0, 1 or more characters. E.g. origin% would match original, originality, originally, originate, originating, originator,
origination... This wildcard can be very useful with inflecting/fusional languages like Russian for example.
_ (underscore) for exactly one character, it does not denote a blank. E.g. p_t would match pot, put, pat…., “organi_ation” would
match both “organization” and “organisation”.
Set: [abc] in a pattern definition means that the system will match either an ‘a’, a ‘b’ or a ‘c’ in that position. E.g. c[aou]t would match
‘cat’, ‘cot’ and ‘cut’.
It is possible to introduce prefixes in the following way: @prefix]. Please note that you have to introduce a prefix only once in the
whole system. E.g. together with words like bug, bunk, but, claim you can introduce @de] and the system will automatically get
debug, debunk, debut, declaim etc….This symbol should be used with caution as it will affect all other alert definitions.
The “+” (white space) sign can be used to build or unite term strings. E.g. Olympic+games, News+Brief, dmitry+medvedev would
match dmitry (white space) medvedev.
Using these wild cards can be very helpful to build common patterns for multiple languages because they can substitute accented characters.
E.g. ent_rotox% would match enterotoxine (de), ente’rotoxin (fr), enterotoxín (sk) etc.
It is possible to use word-initial wild cards (both _ and %), but these should be used as little as possible because they are computationally
heavy. If you only want to cover one word-initial letter, you should use the _ (underscore) instead of %( percent). If there are only two or three
variants of your chosen word/patter, it would be much better to put them in explicitly instead of using a wildcard.
Uppercase/Lowercase definitions
A pattern definition should normally be in lowercase, but can contain upper case characters. In that case the pattern will only match text that
has an uppercase character in the same position. Forcing uppercase can be used for acronyms that would otherwise cause problems. This
means that a lowercase character in a patter matches both lower and uppercase in the incoming text, but an UPPERCASE character only
matches uppercase, so the pattern e.g. “abc” would match ABC, ABc, Abc, aBC,AbC etc but the pattern “Abc” would only match Abc, ABC,
ABc etc all with the uppercase “A”.
Useful tricks
Negative weights can be used. Negative weights can be useful if a search word is homographic with some other unrelated word or
with a person name, or if a search word has many meanings. E.g. if you are interested in finding texts mentioning “tsunami”-sea
storm, you could put several words with a negative score of let’s say -999.
rock-band -999 (there is a famous Indian rock band “Tsunami”)
Arashi+Tsunami -999 (a Japanese voice actor)
Satoshi+Tsunami -999 (a Japanese football player)
deodorant -999 (“Tsunami” fragrance by Axe)
politics -999 (“Tsunami” term used to describe an overwhelming victory by a political party)
Another example: if you are interested in Michael Jackson the Canadian actor and not the musician, you could put words
like pop music, songwriter, dancer etc. with a negative weight.
To use weights and a threshold (e.g. 50), so that some words can trigger the alert on their own (weight = 50) while other words
need to occur cumulatively (several times) before reaching the threshold (e.g. weight = 20). For example:
Be careful with abbreviations. E.g. a very simple (at first sight) abbreviation “ABC” can stand for the following: Latin Alphabet,
American Broadcasting Company, Australian Broadcasting Company, Associated British Company, Appalachian Brewing company,
Atlanta Bread company, Agricultural Bank of China, ABC (programming language), abc conjecture, All Lesotho Convention, ABC
(island in Alaska) etc. Please keep in mind that a simple word (or an abbreviation) in English can mean a completely different thing in
Italian, German, Russian, Bulgarian… E.g. a Portuguese word for “vomiting” is “emese”, and “Emese” is a very common first
feminine name in Hungary. Another typical example is the work ‘mais’ which means the agricultural crop in many languages, but in
French means ‘but’. The French version is written with an accented ‘i’. In order to avoid these conflicts it is sometimes useful to
define multiple combinations with the word lists in the combination reflecting the various languages or language groups. A
combination is unlikely to trigger the category if one or-list consists of English words and the other of French words.
Examples
Here you can find some examples of using the DSL to define Category Definition Files:
By doing Ctrl + Space, the adf editor view shows the next command feasible to be used at the current position of the cursor
Defining the maximum number of search result links
See also: Performing an Internet Search
The search result link extraction function extracts links from the result pages of search engines. It tries to extract the result links on the first
and following pages up to a maximum number of links. You can define this maximum link count as follows:
1. In the main menu click Window > Preferences. The Preferences dialog opens.
2. In the topic tree on the left side, expand OSINT Preferences and click Link Extraction. The Link Extraction preference page is
shown on the right side.
3. In the Link Extraction preference page enter a number in the Maximum link count field, then click OK.
Downloading of Bookmarks
Prerequisites: Creating a Case Project and Performing an Internet Search
Bookmark files contain URLs pointing to some resources on the web. Theses resources are mostly web pages or file resources (such as .pdf
Once all the search
files). To analyse theses resources locally, they need to be downloaded to the Case Project in the workspace.
result links (bookmarks) have been downloaded under the Bookmarks folder, the next step is to download the
result pages as text files:
Right click on the desired bookmark folder in the Workspace Navigator view and click on Download Bookmarks. The Progress
View shows the progress of the download. The downloaded web pages or files (for example PDF files) are stored in the predefined
Documents folder of the case project.
After downloading the search results, a new folder will appear under the Documents folder with the same name as the selected Boo
kmarks folder. It will contain a text file with the raw content for every link stored in the Bookmarks folder. After the download has
finished, the system automatically extracts the raw text from the web pages or files. The system can extract raw text from a variety of
file formats (such as PDF, MS Office formats and others). Also, the system detects the language of the text.
After the download has finished the system automatically starts to extract the text from the resources.
The download process is performed in parallel, but in some cases the process has to wait for URLs to respond which are pointing
to slow servers.
The files in the Documents folder are marked with an > to show the entity extraction has not yet been performed. Some files
may be marked with a red indicator showing that the text extraction has failed. In most cases the file is either an unsupported file
type or does not contain enough relevant text.
The downloaded result pages or result files are given a name based on the result URL they were downloaded from. If a title can be
extracted from the extracted text of the file this title is shown as an overlay in the Workspace Navigator view. If the text extraction
failed to find a title, the original file name is shown instead.
The software allows to filter Bookmarks out by matching the URL to custom patterns.
To define filter patterns do the following to open the Link Extraction Preferences:
Now in the Filter out URL Patterns table custom patterns can be defined:
Filter Pattern
The patterns use regular expression syntax to match against search result URLs
Example
You want to filter out all result Bookmarks which point to linkedin.com pages. Use the following pattern:
.*linkedin\.com.*
The EMM-OSINT Suite also supports importing bookmarks from different web browsers such as Internet Explorer, Firefox or Google Chrome
into the Bookmarks folder of the Case Project. Therefore, the first step will be to export bookmarks from our web browser into a HTML file.
Follow the next links in order to find out how to export bookmarks from your web browser:
Once the bookmarks have been exported from the web browser into a file (usually it is based on an ancient HTML format), to import it
into EMM-OSINT Suite do the following:
Click in the main menu File > Import and from the list of available import sources select Bookmarks > Import Bookmarks from
Chrome/Firefox/IE (HTML file) and click Next. The system shows the Import dialog
Click Browse and select the file exported previously from the web browser. Click Next to proceed and the import wizard shows the
page to define the import location in your Workspace.
Click Browse to select a folder in your Workspace. A file selection dialog opens. Select the Bookmarks folder or a sub-folder and
click OK. The file selection dialog closes
Click Finish to perform the bookmarks import. The system imports the bookmarks from the HTML file and prints a status message
for each imported bookmark in the Console view. After the import has finished successfully, the imported bookmarks appear in your
Workspace
Importing Documents from Local Disk
Prerequisite: Creating a Case Project
The EMM-OSINT Suite allows importing files from local disk into a Case Project for a further analysis. To import files perform the following
procedure:
Click in the main menu File > Import and from the list of available import sources select Documents > File System and click Next.
The system shows the Import dialog
Click Browse to select the import directory in the From directory field. The system shows an Import from directory file dialog. Fro
m the file system list select the import base directory and click OK. The system closes the dialog and returns to the Import dialog. C
lick on a folder or file to select it for import (optional: click Filter Types to filter the import for specific file
types). Click Browse to select the import directory inside your Workspace (note: make sure to select a
Case Project). Enable the Create top-level folder option.
Files must be imported into the Documents folder of a Case Project (or in a subfolder of the Documents folder) in order to be
analysed later using the Entity Extraction
Prerequisite: Download the application archive from our download page. After download you should have an application archive named
emm-osint-suite-<version-number>.linux.gtk.x86_64.tar.gz (for the 64-bit version) in your download folder.
Extract the application archive to a location of your choice. According to the File System Hierachy Standard (FHS) the application should be
placed in /opt/emm-osint-suite-<version-number>
If the software does not start by clicking on the osint executable in the installation directory or gives an error message, please add executable
permission to two files as follows (the directory names may change for later versions - commands for version 2.3.3 shown):
Related Topics:
Running on Linux
Requesting Support
Installing on Windows
Prerequisite: Download the application archive from our download page. After download you should have an application archive named osin
t_<version-number>_x86_64.zip (for the 64-bit version) in your download folder.
Note: For a successful installation local administrative rights may be needed. Contact your PC support for help if the installation fails.
The application archive is a compressed ZIP archive which can be decompressed using standard tools of MS Windows (or using tools such
as 7zip or Winzip).In order to install the application on a MS Windows PC, perform the following steps:
1. Double-click the application archive to uncompress the application archive in the download folder. The uncompress utiltiy should
create an application folder which is named osint_<version-number>_x86_64.
2. Open the Windows File Explorer and navigate to the download folder
3. Move the application folder osint_<version-number>_x86_64 from the download folder to the standard Windows program folder: c:\P
rogram Files\osint_<version-number>_x86_64
You can install the application also in a non-standard location. Since the version number is included in the name of the application
folder, multiple version may co-exist on your system.
Related Topics:
Requesting Support
Running on Windows
To find relevant information on the Internet the application allows you to search using the major Internet search engines and then use the
search results for further research.
Double-click on one of the available search engines in the Search Tools view in the lower-left corner. For this example choose
Microsoft's search engine Bing. The start page of such search engine will be opened in a new Browser editor window.
Perform your search as usual using the online search engine. Then, the result links for such search will appear on
the Browser editor window:
Since you are using the normal web interface of the selected search engine, you can use all available options and refine your
search to be as relevant as possible. As soon as you are satisfied with the shown results, you can proceed to the next step
Click on Web > Extract Search Result Links to extract the search results from the result page of the search engine and store them
in your Case Project. The link extraction process starts and the system shows a warning that it will take control of the Browser editor
in order to extract the links. They
will be stored as bookmarks. If more than one case project exists, the system
opens the Project Selection dialog to choose in which case project we want to store the search links
(bookmarks). The software extracts ("scrapes") the search result links from the first and subsequent
pages of the search engine. After it has extracted the maximum number of links (by default 100 links), it
returns to the first result page. The progress of the scraping is shown in the Progress View. You can stop
the link extraction process any time by clicking on the red cancel icon in the Progress View
After the link extraction has finished, it creates a folder with Bookmarks files in the Bookmarks folder of the defined target Case
Project. The extracted result links are stored as .obm files. A .obm file which we also call a “bookmark file” contains the actual result
link plus meta information such as the search engine, the search query and a time stamp. Such meta information can be found by
clicking on the bookmark file and then clicking on the Properties view.
If several searches are carried out on the same Case Project, a sign (d) might appear on the left side of the Bookmarks folders. It
means that there are some duplicate links inside that folder regarding all the links downloaded for the same Case Project. To
discard them, click on Web > Delete Duplicate Bookmarks.
In order to review the result links faster without opening a new browser window each time, you can select from
the main menu Window > Reuse Browser . If this menu item is enabled the current browser editor view is
reused and loads the link of the bookmark file with a simple click on the file in the Workspace Navigator view.
If you do a double-click on a bookmark file you can still force to open a new browser editor. If you want to delete irrelevant bookmark files,
you can simply right-click the file and click Delete.
The system automatically detects bookmarks pointing to the same URL (so-called duplicates). These duplicates are marked with a (d) prefix
in the Workspace Navigator view. For each duplicate bookmark there exists another bookmark pointing to the same URL but with an older
time stamp. In order to delete all duplicates under a certain folder: right-click on the folder in the Workspace Navigator view and click Organ
ize Bookmarks > Delete Duplicates. The duplicate detection is handy if you gather search results from multiple search engines and merge
the results before you download the result pages to your case project.
See also: Using the duplicate bookmark detection and Delete duplicate bookmarks.
The entity extraction finds entities in the set of documents located in the Documents folder of a Case Project. The Documents folder is a sp
ecial predefined folder which contains all input documents for the entity extraction. This means that download bookmarks or import
documents from local disk into the Document folder is a prerequisite for running the entity extraction process.
The Documents folder is marked with a special icon: . Whenever the Documents folder shows a greater
sign in front of its name, the entity extraction needs to run:
In the Workspace Navigator view click on the case project containing the Documents folder which is not up-to-date.
In the main menu click on Project > Build Extraction. The entity extraction starts to run, you can monitor the progress in the Progre
ss View
Alternatively, you can use the context menu to run the Entity Extraction. Right-click
on the Documents folder in the Wor
kspace Navigator view, then click Build Extraction. The entity extraction starts to run.
As soon as the entity extraction has finished, the Documents folder is no longer marked with the > sign:
Requesting Support
If you experience any problems with the software or would like to suggest a new feature, please contact us.
Write an email to mailto:[email protected] and include as many details as possible in your email.
Attach the log file to your email which allows us to find the error cause more quickly:
1. Open you current workspace directory using the Windows File Explorer and open the sub-directory .metadata/.plugins/it.jr
c.osint.logging
2. The sub-directory contains the log file named osint.log (possibly along with some older archived log files)
3. Zip the osint.log file and attach it to your email
Note: Normally, the log file should not contain any confidential data. However, you may want to review it before sending it to us.
Running on Linux
Prerequisite: Installing on Linux (Ubuntu)
Once the EMM OSINT Suite has been installed on your system, running the application requires you top define a workspace folder on a local
disk drive to keep your user data.
Important
The workspace folder needs to be placed on a local disk drive. The application will fail if the workspace is on a network drive.
By default the application places the user data in the default user home directory /home/<user-id>/osint-workspace.
cd /opt/osint_2.3.1_linux.x86_64
If the application does not start, make sure that the following files are executable:
/opt/osint_<version-number>_linux_x86_64/osint
/opt/osint_<version-number>_linux_x86_64/jre/bin/java
Under Ubuntu you can set the execute permissions with the following commands:
Running on Windows
Prerequisite: Installing on Windows
Once the EMM OSINT Suite has been installed on your system, running the application requires you to define a workspace folder on local
disk to keep your user data.
Important
The workspace folder needs to be placed on a local disk drive. The application will fail if the workspace is on a network drive.
By default the application places the user data in the default Windows User Directory under c:\Users\<User-Name>\osint-workspace.
1. Open the installation location in Windows File Explorer. By default the location is c:\Program Files\osint_<version-number>_x86_64.
2. Double-click on the osint.exe executable to start the application. The application shows a splash screen, then shows the Workspac
e Selection dialog.
3. In the Workspace Selection dialog either
Click OK to choose the default workspace location
or Click Browse to define a custom workspace location. The application shows a file dialog to choose the custom
workspace location.
4. In the Workspace Selection dialog you can also enable the Remember Workspace Location check box. If enabled, the application
will not show the Workspace Selection dialog during next startup but simply re-use the selected location.
You can switch to a different workspace location from within the running application by clicking on File > Switch Workspace in the
main menu. This option is needed if you enabled Remember Workspace Location previously in the Workspace Selection dialog
and want to change to a different workspace location later on.
The reason is that the underlying Java VM needs a contiguous memora area. If there are a lot of other applications open and drivers loaded
the system is not able to allocated the needed memory area.
Related Topics:
Requesting Support
The system prints the found proxy information in the Console View.
Use the printed information to set the proxy information in the Network Connections Preference Panel.
Fill in the Host and Port fields with the proxy host and port.
If the edited proxy schema requires authentication, enable the Requires Authentication: option.
Fill in User and Password for authentication
Click OK to save the schema entry
In the Network Connections preferences dialog click OK to save the new preference settings.
The local search in the OSINT Suite provides a rich query language through which the user can perform wildcard searches or boolean
queries. A query can be broken up into terms and operators. There are two types of terms: single terms and phrases. A single term is a
single word such as "house" or "car". A phrase is a group of words surrounded by double quotes such as "white house". Multiple terms can
be combined together with boolean operators to form a more complex query.
Wildcard searches
The local search supports single and multiple character wildcard searches within single terms (not within phrase queries):
To perform a single character wildcard search use the "?" symbol. The single character wildcard search looks for terms that match
that with the single character replaced. For example, to search for "text" or "test" you can use the search: te?t
To perform a multiple character wildcard search use the "*" symbol. Multiple character wildcard searches looks for 0 or more
characters. For example, to search for test, tests or tester, you can use the search: test*. You can also use the wildcard searches in
the middle of a term: te*t . Note that you cannot use a * or ? symbol as the first character of a search.
Boolean searches
Boolean operators allow terms to be combined through logic operators. These are the operators that our local search supports:
AND
+
OR
NOT or !
-
Note: boolean operators must be ALL CAPS. Some examples of using boolean searches:
1. To search for documents that contain either "white house" or just "house" use one of the following queries (note: you can use double
quotas for searching an exact phrase):
2. To search for documents that contain "white house" and "black house" use the query:
3. The "+" or required operator requires that the term after the "+" symbol exist somewhere in the document. Thus, to search for
documents that must contain "house" and may contain "white" use the query:
+house white
4. The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol !
can be used in place of the word NOT. To search for documents that contain "white house" but not "black house" use the query:
Note: The NOT operator cannot be used with just one term. For example, the following search will return no results:
5. The "-" or prohibit operator excludes documents that contain the term after the "-" symbol. To search for documents that contain
"white house" but not "black house" use the query:
Grouping
Our tool also supports using parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean
logic for a query. To search for either "white" or "black" and "house" use the query:
The Category Browser view is used to find which documents belong to any category previously predefined in the OSINT Suite. A category
or alert can be seen as a file in which several keywords or even combinations of keywords are defined by the user. Therefore, before using
the Category Browser view, a Configuration Project should be created and then one Category Definition File should be generated at least
in order to check how the Category Browser view works.
The Category Browser view is usually opened behind the Entity Browser view:
If it is not shown behind the Entity Browser tab once OSINT starts, then it can be activated by clicking on Window > Show View > Category
Browser. To bring it to front, click its title bar.
After activating the Category Browser view, a message indicating active Configuration Project was not found may be shown if there
is no an active Configuration Project in the workspace.
The Entity Browser view can be used to browse the entity information extracted from the set of documents. By default the Entity Browser v
iew is opened behind the Workspace Navigator view. To bring it to front, click its title bar.
The context menu can be accessed by clicking the small triangle next to the title tab. The context menu contains the following actions:
Refresh - forces the system to reload the data in the Entity Browser
Filter Entity Types... - Allows to filter out certain entity types
Sort Entity By
Alphabetic - orders entities in the Results Tree in alphabetic order
Frequency - orders entities in the Results Tree by number of occurrences (frequency)
Related Docs - orders entities in the Results Tree by number of related documents (documents which contain the entity)
Project Selector:
The project selector is used to select one of the existing case projects in the workspace.
The search field is used to search for specific entities. It supports * as a wildcard for any characters. For example, the search term "Barac*"
will find any entity starting with "Barac". Enter a search term and click Search. The results will be shown in the result tree.
Back Navigation:
This drop-down element shows a list of previously defined search operations. Select one to go back to previous search results.
This is a tree-like viewer which shows the found entities ordered by type.
Freq - Frequency of Occurrence, the number of occurrences of an entity across all documents
Docs - Related Documents, the number of documents which contain the entity
By default the columns show the frequency and related documents of an entity across all documents in the case project. If a search for a
specific entity or entities related to a specific entity is active (see the Back Navigation drop down) the data in relation to the current search is
shown.
Right-click on an entity to show the Entity Context Menu. The context menu contains the following actions:
1. Enter the name of the entity or the start of the name in the Entity Search Field
2. Click Search or hit the enter key
The search field supports the use of wild card symbols to search for parts of names, for example “Franz*” searches for all entities which start
with “Franz”. The wildcard patterns support ? to replace a single character and * to replace one or multiple characters. By default the system
adds the * wildcard pattern to the end of a term to match all entities starting with this term. If you want to search for an exact term, enclose
the term in double quotes.
In the result tree either all entities or the entities matching the search term are shown. Expand the result tree to the next level to view the
documents where the different entities are found. Double-click on a document icon in the result tree to open the document in the Document
editor view.
Right-click on an entity, and click Show related entities to show all related entities in the Entity Browser view. In order to go back, select a
previous set of entities from the drop down menu which contains a list of previous actions.
Now the Entity Browser view is updated showing only the related entities to the previously selected one
Using the Graph view
The Graph view is automatically shown if you add an entity to it. It shows the co-occurrence relationship between entities in the case project.
Right-click on a single or multiple selected entities in the Entity Browser view and click Add to Graph to add it the the Graph view.
Whenever an entity is added to the Graph view, the system tries to find existing co-occurrence relationships with the entities already shown
in the graph.
Note: The search for co-occurrences may block the user interface for a short while, this is a known-issue and will be fixed in one of the future
versions of the software
Double-click on the title tab of the Graph view to maximize it. A click on the small triangle to the right of the view’s title tab reveals the view’s
context menu. To refresh the Graph view click on the triangle and then click Refresh.
Double-click on the edge connecting two entities in the graph to show a dialog with the documents establishing the relationship between the
two entities. From the connection dialog you can select one or many documents to open them in the editor area.
We provide a virtual image with a Ubuntu 14.04LTS installation to be able to run the EMM OSINT Suite software on Mac OS X.
Installing VirtualBox
Download VirtualBox from its web site https://www.virtualbox.org/
Choose "VirtualBox for OS X hosts"
Install VirtualBox from DMG image using installer
General
What is the EMM OSINT Suite?
Licensing
How can I license the EMM OSINT Suite?
Technical Questions
I work for a public authority or institution in one of the member states of the European Union
Please send us a signed license agreement. The agreement should be signed at least by a department manager (head of unit) or above and
can then be used for the complete department or authority.
Please send us a signed license agreement. The agreement should be signed at least by a head of unit (or above) and can then be used for
the complete unit (or institution).
Please contact [email protected] we need to ask permission from our hierarchy. After your request has been reviewed we will
contact you explaining the next steps (e.g. license agreement, etc.).
Please contact [email protected] we need to ask permission from our hierarchy. After your request has been reviewed we will
contact you explaining the next steps (e.g. license agreement, etc.).
To recover data from a corrupted workspace, please perform the following steps:
Copying Case Project data from the old workspace to the new workspace
While the EMM OSINT Suite instance is running with the new workspace, you can copy the following user data simply using the Windows
File Explorer or the command line on Linux from the old project directory to the new project directory:
Bookmark files
Crawler Configuration files
files from the Documents folder
modified and addef files from a custom Configuration Project
You must not copy files from the .metadata or .osint directory while an instance of the application is running.
If you have modified the Name Variant Database (see Importing or updating the name variant database file) you can also copy the name
variant database file across:
Result Link Extraction works, but the download of the resulting Bookmarks
fails
The extraction of links from the result page of a search engine works, but I cannot download the resulting Bookmarks, what is the matter?
The software most likely failed to detect the correct proxy settings. The Browser View relies on an embedded system browser and therefore
uses the system-wide settings. However, the operating system does not provide proxy authentication information so the automatic detection
of proxy settings may have failed.
Please refer to Setting HTTP Proxy Information to find out how to detect and set the proxy information manually.
Currently we provide versions in 32- and 64-bit flavour for Microsoft Windows and Linux. If you run Windows XP please use the 32-bit
version.
If you run Windows 7 or better, you can do the following to check whether you have already a 64-bit system:
Note: The 32-bit version runs also on 64-bit systems. However, if supported by your system, prefer the 64-bit version.
The application startup fails due to "Companion shared library not found"
The application startup may fail with the error message "Companion shared library not found". A possible reason is that the companion
shared library (which is a DLL under Windows) may have been taken away by an anti-virus scanner running on your system.
Check if the companion shared library can still be found in your installation directory:
Open the installation directory (e.g. osint_2.2.3_win32.x86_64) in the Windows File Explorer
Open the sub directory plugins
Check if there is a sub directory starting with org.eclipse.equinox.launcher.win32.win32.x86_64<version-id>
Check if the sub directory contains a file eclipse_<build-id>.dll (the current build-id is 1503 but may change for newer releases)
If the shared library is not there, please check your anti-virus scanner if it has "quarantined" this DLL.
Running without the eclipse launcher and the companion shared library
If due to restrictions on your PC the companion shared library cannot be used, the software can also be run from the command line:
Open the Windows Command Prompt by selecting it from the Start Menu (usually it can be found under Accessories)
In the Command Prompt change to the installation directory of the application (for example if it is installed in the program folder)
cd "c:\Program Files\osint_2.2.3_win32.x86_64"
The information extraction modules require a lot of memory and CPU power. The faster the CPU is the better. System memory should not be
less than 4GB.
Mac OS X
We always test against the latest Apple desktop operating system (as of 01/2014 OS X Mavericks). The system should have a fast CPU and
at least 4GB of RAM. Please check the Release Notes which restrictions for OS X may apply.
Linux
The software is tested to run on Ubuntu desktop 13.04. Both 32-bit and 64-bit versions are available. However, we stronlgy advise you to run
64-bit. Fast CPU required and minimal 4 GB of RAM.
Glossary
Boolean Logic
Boolean logic is named after the mathematician George Boole. It is a form of algebra in which all values are reduced to TRUE or FALSE. We
can use the Boolean operators to form search queries and to filter results of a search more effectively.
Boolean Operators
There are three different operators: AND, OR and NOT.
AND Operator
The and AND operator requires that all terms connected by the operator appear in the search results. For example, if someone searches for
Barack AND Michelle, only results will appear where both terms are present. Google uses an implicit AND opoerator, that means a search
query Barack Michelle equals Barack AND Michelle.
OR Operator
The OR operator is used to connect terms in a search query. The search engine results list found pages containing either of the two or more
connected terms. For example, Barack OR Michelle will find all pages where either Barack is mentioned or Michelle is mentioned or both
are mentioned. This example will yield a hugh number of results, since "Michelle" is a pretty common name.
NOT Operator
The NOT operator is used to exclude pages with certain terms from a search result. For example Barack AND NOT Michelle will yield all
pages containing the name Barack but which do not contain Michelle. Note: Google uses a dash "-" for the NOT operator and has an implicit
AND. Therefore, Barack AND NOT Michelle will be written as Barack -Michelle in Google's query language.
Intelligence Cycle
Law enforcement authorities, security and intelligence services rely on some core processes to derive intelligence from input data. The
classical intelligence cycle forms a first framework to detect the consecutive stages from finding and acquiring raw data to deriving
intelligence in a determined way.
(Security Intelligence Cycle of the New Zealand Security Intelligence Service, http://www.security.govt.nz/our-work/our-methods/)
Identifiying Threats - this is the inital step which starts the cycle
Setting Objectives - in this step the goal is to specify which questions need to be answered to be able to assess the threat
Collecting Information - this step covers the activity to harvest data from a wide variety of sources (OSINT - publically available
sources)
Investigating and Analysing Information - this step contains automatic and manual information extraction and processing, e.g. find
the name of a specific person in a large collection of documents
Assessing and Reporting Information - this step describes how the gathered information is put into reports.
Reassessing Threats - with the gained knowledge, a threat is reassessed and depending on the outcome, the cycle may start again.
Even initially developed for intelligence services, the process framework can be equally used for classical law enforcement investigations.
Internet Protocols
At the heart of the Internet is a set of rules defining how computers can exchange messages. In computer science a set communication rules
is commonly called a protocol. The messages are well-defined and each message has an exacte meaning to provoke a particular response
of the receiver.
The Internet Protocol Family consists of various protocols which describe different aspects of network communication. Commonly, these
protocols are put in a layered reference model to categorize the protocols. According to Andew S. Tanenbaum (see [1], chapter 1.4.3) the
reference model for the Internet Protocol contains five layers:
Application Contains programs that make use of the HTTP (Web), SMTP (Mail), RTP, DNS
network,
In the following we will describe only the most important protocols, for an in-depth description of all protocols, see [1].
Every device attached to the Internet must have a unique address. In the current dominant protocol of the Internet, Internet Protocol Version
4 (IPv4), the addresses are made up of four sets of numbers separeted by periods (e.g. 192.168.2.1. ). An IP address is similar to a postal
address, it addresses a unique destination in the network. However, postal addresses are usually fix, whereas IP addresses may be
assigned to a device only for the time the device is connected to the network. Network components, such as routers which are permanently
connected to the network have usually permanently assigned address (so-called static IP addresses).
IP Datagrams
A datagram (also commonly called a data packet) is a fundamental unit of data that is sent between a source and a destination in the
Internet. A datagram is made up of an IP header and the payload with the actual data (e.g. the content retrieved from a web site). The header
contains the source IP address, the destination IP address and other meta-data needed to deliver the packet.
Routing Protocol
Routing is the process of finding the destination computer (host) for a packet which is being forwarded in the network. Since the Internet is a
network of networks, there may exist multiple paths from a source IP address to a destination IP address (like for a postal package there are
many roads leading from one city to another). The task of the router is to forward a data packet towards is destination. For this purpose the
router uses routing tables to determine where a packet is going and how to send it.
In the above example, a PC A wants to send some data to PC B. The IP network software on PC A encapsulates all data in IP datagrams (IP
packets). These packets are then send to the Router R2 which connects PC A to the Internet (in this example made up of three sub
networks). The datagrams contain the source IP address of PC A inside the network with addresses starting with 20.0.0.0 and the target IP
address of PC B which is 30.0.0.1:
Header
Source 20.0.0.1
Destination 30.0.0.1
Payload
Hello There!
The router R2 which connects PC A to the Internet forwards all datagrams with an address other than 20.x.x.x to router R4. Now, router R4
needs to decider where to forward the incoming datagram. For this reason router R4 consults its routing table. The routing table defines that
all datagrams with a destination address of 30.0.0.0 need to be forwarded via Port 2 to router R3. Router R4 forwards the datagram to router
R3. Router R3 knows all connected PCs and forwards the datagram to the final destination 30.0.0.1 which is PC B.
Application Protocols
Application layer protocols describe how applications using the network can communicate with each other.
A web server program is providing a web site made up of resources such as HTML documents and media files (images, movies, etc.) to
clients on the WWW. A web browser is a typical client which requests resources from the server. For this reason the web browser sends a
HTTP Get request to the web server. The web server responds either with the requested resource (for example a HTML document) or a
status code describing an error condition. Resources on a server are identified using Uniform Resource Identifies (URIs) or, more spcifically
for HTTP, using Uniform Resource Locators (URLs)
The Simple Mail Transport Protocol is the Internet standard protocol to transmit elecontric mail. Usually SMTP is used to exchange message
between mail servers. Most email client programs use it only to send email, and use other protocols, such as the Post Office Protocol (POP)
or the Internet Message Access Protocol (IMAP) to retrieve email messages from a mail server.
The search engine use the Index component which is created during the web crawl operation.
1. A search engine user types in a search query into the web form presented by the Search Engine Frontend (web server)
2. The Frontend component translates this query into internal retrieval commands. Furthermore, it takes into account additional
knowledge about the user such as the user's search history or typical search queries used by a large number of users. The retrieval
commands are forwarded to the Query Engine component.
3. The Query Engine is responsible to retrieve results for the query from the Index. The Query Engine and the index are massively
distributed systems so many servers will be queried in parallel to produce results.
4. The Index component returns results to the Query Engine.
5. The results are forward to component providing a ranking algorithm. The results are ranked according to different criteria calculating
a score for each search result. Typical static criteria include the number of occurrences of a search term in a page. These static
criteria are combined with dynamic criteria such as user related information (search history, etc.).
6. The ranked results are forward to the Search Engine Frontend.
7. The Front End component renders the search result pages and returns it to the user. The result page is in most cases enriched with
data from other systems such as advertising servers.
For the purposes of law-enforcment the web based information becomes more and more prevalent to support investigations into fraud and
criminal activities.
References
[1] Wikipedia Entry Open Source Intelligence. Retrieved September 10, 2013, from http://en.wikipedia.org/wiki/Open-source_intelligence
Concepts
Entity Extraction
Reporting
Text Extraction
Category Matching
Category Matching is the process of classifying or grouping documents according to a field of interest (which we call a category). EMM
OSINT Suite allows to classify documents on disk which have been previously downloaded or imported.
The software provides an editor to define categories. Each category definition consist of a combination of keywords. With the help of these
keyword combinations the software can categorize a set of documents.
Entity Extraction
The goal of the Entity Extraction is to find locations in the text which contain entity information. In other words it tries
to find occurrences of person names, locations, VAT numbers, etc..
The overall process is split into sub modules which run in a pipeline like fashion.
The system performs the following extraction steps:
Name Variant Matching Matches name variants from the name variant database
Regular Expression Matching Matches buit-in types such as vat number, email address, url,
ip address, credit card numbers, date, phone number, zip
code, personal id
Custom user-defined types based on regular expressions
Entity Normalisation Combines similar name variants to a single entity profile, provides
unique ids to entities accross the document set
The Name Variant Database contains entities of various types (e.g. Person, Organisation, etc.). It is amended each time the Entity
Normalisation process finds a new entity.
Note: The initial Name Variant Database is created automatically from the EMM NewsBrief system. Therefore, the quality of the entries may
vary.
Each possible spelling of an entity is called a Name Variant (or short “a variant”). Since a person entity can have many different spellings of
its name, the variants are clustered in a so called Name Variant Profile (or short: “profile”). The name of a profile is taken from one of its
variants. We call this variant the canonical variant.
For example, the profile for “Franz Beckenbauer” (a former German soccer player) contains a variety of variants which can also contain
misspellings of his name.
The profile is named “Franz Beckenbauer” after the canonical variant. Each profile has a unique id in the system. Therefore, all variants
found belonging to the same profile will get the same profile id (and represent the same Entity in the system). The Name Variant Database is
automatically amended by the entity normalisation module which tries to find variants belonging to the same profile. In addition, the profiles
and variants in the database can be edited manually.
Geo Matching
The Geo Matching module matches the text against a database of location names (Countries, Regions, Cities, etc.).
Regular Expression Matching
The Regular Expression Matching module matches built-in entity types and user defined custom entity types. The matching is based on Reg
ular Expressions.
Entity Guessing
Entity Normalisation
Regular Expressions
A Regular Expression (often abbreviated regex or regexp) is a sequence of characters which forms a search pattern used to match strings.
Each character in a regular expression is either a meta-character with a special meaning or a regular character with its literal meaning.
.*linkedin\.com.*
Resources
Reporting
A report is a way to export the analysed data (entities and relationships) of a case project. Using different templates output file with different
formats (Text, HTML, CSV, XML) can be created. The reporting mechanism is made up of three components:
1. Templates
2. Data Objects
3. Scripts
The templates used to create a report are stored internally and can be modified by creating a configuration project.
Data Objects
Project Object
The project object is the main entry point to access the data (documents and entities and relationships) of a project.
Note: The algorithm to calculate the EntityEntityRelations list is currently quite slow and takes for large data sets a considerably amount of
time. This is a known issue and will be improved in a future version.
Entity Object
The entity object represents a single entity found in a document of the project.
Document Object
DocumentEntityRelation Object
The documentEntityRelation object represents the relation between a document and an entity.
TextPosition Object
The textPosition object represents a position in a document where an entity was detected (see $documentEntityRelation.TextPositions).
Text Extraction
Text Extraction is the process of extracting raw text from multiple input file formats.The Text Extraction module of EMM OSINT Suite is based
on the open source project Apache Tika.
In addition to extract the text, the language of the text is identified and stored as meta data.
Training
Welcome to the EMM-OSINT Suite Training Sessions!
The EMM Open Source Intelligence Suite is a desktop software application which helps to find, acquire and analyse data from the Internet
and local sources. It provides automatic means to gather intelligence from open available sources by removing the need to search manually
through vast data sets.EMM OSINT Suite comprises a set of powerful tools to support the main processes of intelligence gathering from
open sources. Documents can be acquired from the public internet as well as from local sources. The core of the software is the entity
extraction module which matches text locations against pre-defined patterns for different type of entities, such as person, organisation and
place names, credit card numbers, VAT identifiers, URLs, etc.. User defined patterns can be added to find investigation specific entity types,
such as number plates or tax identifiers. The analysis views allow to make sense of the data.
Here you will find material for the training sessions introducing different modules of the software:
Microsoft Windows
Linux (Ubuntu)
MacOSX
Linux (Ubuntu)
Installing EMM-OSINT Suite on Linux (Ubuntu)
Installing on Linux.
Microsoft Windows
Installing EMM-OSINT Suite on Microsoft Windows
Installing on Windows.
MacOSX
Installing EMM-OSINT Suite on MacOSX
2. Getting Started
Quick Start Guide.
3. Module Sessions
C1 - Data Acquisition
C1 - Lab Exercises
C2 - Entity Extraction
C2 - Lab Exercises
C3 - Entity Extraction Advanced
C3 - Lab Exercises
C4 - Reporting & Data Export
C4 - Lab Exercises
C5 - Category Matching
C1 - Data Acquisition
EMM-OSINT Suite provides a browser based search interface to the internet search engines. Search results can be downloaded for further
local processing. In addition to using search engines, targeted websites can be crawled using the embedded web crawler. The crawler
follows the link structure of a website and downloads relevant pages to local disk. A file import wizard complements the acquisition tools. It
allows importing locally stored documents for further analysis. For further processing the plain text is extracted from a variety of document
formats, such as HTML, PDF and Microsoft Office.
Before starting, a Case Project must be created as prerequisite (see creating a Case Project).
Performing a Search
The search module allows to gather result links as bookmarks from internet search engines
Managing Bookmarks
The crawler allows to collect data from specific web sites by following the site structure
Data can be imported from local disk into a Case Project. One way is to import documents (plain text files, PDF files, Microsoft Office files)
for later analysis. Another way is to import bookmarks from web browsers:
Search on Google
Search on Bing
C2 - Entity Extraction
The Entity Extraction finds locations in the text which contain "entity information", such as person names, locations, organizations, VAT
numbers, etc.
The entity extraction searches a set of documents located in the Documents folder of a Case Project. The Documents folder is a special
predefined folder which contains all input documents for the entity extraction.
Once the Entity Extraction has finished, you can review the entities which were found.
The Name Variant Database is used to match variants of an entity. (Such as Barack Obama, President Obama, etc.)
C2 - Lab Exercises
Open the Entity Browser and look for entities with similar names
Use the editor to consolidate two entity variants into a single profile
Create a regular expression pattern to extract container numbers (BIC code) from a set of documents.
You can use http://www.regexr.com/ to develop a pattern according to the wikipedia definition of container numbers.
[A-Z]{3}[UJZR][0-9]{7}
Now you have to create a custom entity pattern file in <config-project>/Custom Entities/Active Entities
named bic.xml
bic.xml
<?xml version="1.0" ?>
<expressions>
<!--
OPTIONAL: the declaration tag can be empty
This declaration section defines additional scripts used for
validation. The scripts are written in Groovy, a scripting language
which is a superset of Java. See http://groovy.codehaus.org/ for more information.
-->
<declaration><![CDATA[
/**
* This predefined init method is called only once, during initialisation of the
* custom entity module. It should be used to load resources for validation.
*/
public void init() {
//this function can be used to load resources from
//the "Custom Entities/Resources" directory in the active configuration project.
//The path of this directory can be accessed from the predefined resourcespath
variable.
//The loaded resources should be stored to the context variable in order to be
accessible
//from other scripts which are called for each text match.
}
/**
* This is an example of a validation function, which can be accessed as
* global.validate(term) to further validate a matched pattern.
*/
public boolean validate(String term) {
return true;
}
]]>
</declaration>
<!--
MANDATORY: the type tag must define a two letter id and a description
of the custom pattern.
-->
<type id="bc" description="container bic"/>
<!--
MANDATORY: At least one expression definition must exist.
-->
<expression>
<!--
MANDATORY: regex contains the regular expression to match the text.
By default it uses the restricted "Brics syntax" for performance reasons.
See http://www.brics.dk/automaton/doc/index.html?dk/brics/automaton/RegExp.html
for
more information.
Create a custom entity type with patterns which match the ISIN number of at least three European countries
File Modified
See Editing the current Name Variant Database in order to modify or add new entities to the current Name Variant Database in OSINT.
Create a Name Variant file with an entity profile and a number of variants for the person name
Import the Name Variant file into the Name Variant Database
C3.2 Export the Name Variant Database and Import an improved one
C3.3 (Advanced) Creating a Custom Entity Type to match a list of company names
Create a custom entity type which loads a list of person names from a text file.
C4 - Reporting & Data Export
Bookmarks are files in the workspace containing a web URL and some meta data. The meta data contains the following data:
Key Description
Search Engine search engine the URL was extracted from (e.g. Google)
Netscape bookmarks file An old html based file format to export a hierarchy of bookmarks.
This format can be read by Firefox, Microsoft Internet Explorer and
Google Chrome.
No meta data is exported, only the URL and the title.
Tab separated value file A plain text file containing all bookmarks and meta data in columns
with tabulator
Documents can be exported to a proprietary XML format. Either all documents are exported into multiple XML documents or a single large
XML document can be created.
Tag Description
C4.4 Reporting
A report is a way to export the analysed data (entities and relationships) of a case project. Using different templates output file with different
formats (Text, HTML, CSV, XML) can be created. The reporting mechanism is made up of three components:
1. Templates
2. Data Objects
3. Scripts
The templates used to create a report are stored internally and can be modified by creating a configuration project.
The system ships with a set of predefined reports. Which are accessible from the Reports view. Please refer to Generating a Report.
Export bookmarks from your project and import the Netscape Bookmark file into the browser of your choice