Sikuli Tutorial
Sikuli Tutorial
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
a screenshot as a query. Similarly, to automate interactions ity to search for documentation about an arbitrary GUI
with a GUI element, a programmer can insert WKHHOHPHQW¶V element is crucial when users have trouble interacting with
screenshot directly into a script statement and specify what the element and the appliFDWLRQ¶VEXLOW-in help features are
keyboard or mouse actions to invoke when this element is inadequate. Users may want to search not only the official
seen on the screen. Compared to the non-visual alterna- documentation, but also computer books, blogs, forums, or
tives, taking screenshots is an intuitive way to specify a online tutorials to find more help about the element.
variety of GUI elements. Also, screenshots are universally Current approaches require users to enter keywords for the
accessible for all applications on all GUI platforms, since it GUI elements in order to find information about them, but
is always possible to take a screenshot of a GUI element. suitable keywords may not be immediately obvious.
We make the following contributions in this paper: Instead, we propose to use a screenshot of the element as a
x Sikuli Search, a system that enables users to search a query. Given their graphical nature, GUI elements can be
large collection of online documentation about GUI most directly represented by screenshots. In addition,
elements using screenshots; screenshots are accessible across all applications and plat-
x an empirical demonstration of the V\VWHP¶V DELOLW\ WR forms by all users, in contrast to other mechanisms, like
retrieve relevant information about a wide variety of tooltips and help hotkeys (F1), that may or may not be im-
dialog boxes, plus a user study showing that screen- plemented by the application.
shots are faster than keywords for formulating queries System Architecture
about GUI elements; Our screenshot search system, Sikuli Search, consists of
x Sikuli Script, a scripting system that enables program- three components: a screenshot search engine, a user inter-
mers to use screenshots of GUI elements to control face for querying the search engine, and a user interface for
them programmatically. The system incorporates a adding screenshots with custom annotations to the index.
full-featured scripting language (Python) and an editor Screenshot Search Engine
interface specifically designed for writing screenshot- Our prototype system indexes screenshots extracted from a
based automation scripts; wide variety of resources such as online tutorials, official
x two examples of how screenshot-based interactive documentation, and computer books. The system represents
techniques can improve other innovative interactive each screenshot using three different types of features (Fig-
help systems (Stencils [8] and Graphstract [7]). ure 2). First, we use the text surrounding it in the source
This paper is divided into two parts. First we describe and document, which is a typical approach taken by current
evaluate Sikuli Search. Then we describe Sikuli Script and keyword-based image search engines.
present several example scripts. Finally we review related
work, discuss limitations of our approach, and conclude. Second, we use visual features. Recent advances in com-
puter vision have demonstrated the effectiveness of
SCREENSHOTS FOR SEARCH representing an image as a set of visual words [18]. A visu-
This section presents Sikuli Search, a system for searching al word is a vector of values computed to describe the visu-
GUI documentation by screenshots. We describe motiva- al properties of a small patch in an image. Patches are typi-
tion, system architecture, prototype implementation, the cally sampled from salient image locations such as corners
user study, and performance evaluation. that can be reliably detected in despite of variations in
Motivation scale, translation, brightness, and rotation. We use the SIFT
The development of our screenshot search system is moti- feature descriptor [11] to compute visual words from sa-
vated by the lack of an efficient and intuitive mechanism to lient elliptical patches (Figure 2.3) detected by the MSER
search for documentation about a GUI element, such as a detector [12].
toolbar button, icon, dialog box, or error message. The abil-
Screenshot images represented as visual words can be in-
dexed and searched efficiently using an inverted index that
contains an entry for each distinct visual word. To index an
image, we extract visual words and for each word add the
image ID to the corresponding entry. To query with another
image, we also extract visual words and for each word re-
trieve from the corresponding entry the IDs of the images
previously indexed under this word. Then, we find the IDs
retrieved the most number of times and return the corres-
ponding images as the top matches.
Third, since GUI elements often contain text, we can index Figure 3: User study task, presenting a desktop im-
their screenshots based on embedded text extracted by opti- age containing a dialog box (left) from which to for-
cal character recognition (OCR). To improve robustness to mulate a query, and search results (right) to judge
OCR errors, instead of using raw strings extracted by OCR, for relevance to the dialog box.
we compute 3-grams from the characters in these strings. using screenshots. To save a screenshot of a GUI element,
For example, the word system might be incorrectly recog- the user draws a rectangle around it to capture its screen-
nized as systen. But when represented as a set of 3-grams shot to save in the visual index. The user then enters the
over characters, these two terms are {sys, yst, ste, tem} and annotation to be linked to the screenshot. Optionally, the
{sys, yst, ste, ten} respectively, which results in a 75% user can mark a specific part of the GUI element (e.g., a
match, rather than a complete mismatch. We consider only button in a dialog box) to which the annotation is directed.
letters, numbers and common punctuation, which together
define a space of 50,000 unique 3-grams. We treat each Prototype Implementation
unique 3-gram as a visual word and include it in the same The Sikuli Search prototype has a database of 102 popular
index structure used for visual features. computer books covering various operating systems (e.g.,
Windows XP, MacOS) and applications (e.g., Photoshop,
User Interface for Searching Screenshots Office), all represented in PDF4. This database contains
Sikuli Search allows a user to select a region of interest on more than 50k screenshots. The three-feature indexing
the screen, submit the image in the region as a query to the scheme is written in C++ to index these screenshots, using
search engine, and browse the search results. To specify the SIFT [11] to extract visual features, Tesseract5 for OCR,
region of interest, a user presses a hot-key to switch to Si- and Ferret6 for indexing the text surrounding the screen-
kuli Search mode and begins to drag out a rubber-band shots. All other server-side functionality, such as accepting
rectangle around it (Figure 1). Users do not need to fit the queries and formatting search results, is implemented in
rectangle perfectly around a GUI element since our screen- Ruby on Rails7 with a SQL database. On the client side, the
shot representation scheme allows inexact match. After the interfaces for searching and annotating screenshots are im-
rectangle is drawn, a search button appears next to it, which plemented in Java.
submits the image in the rectangle as a query to the search User Study
engine and opens a web browser to display the results. We have argued that a screenshot search system can simpli-
User Interface for Annotating Screenshots fy query formulation without sacrificing the quality of the
We have also explored using screenshots as hooks for an- results. To support these claims, we carried out a user study
notation. Annotation systems are common on the web (e.g. to test two hypotheses: (1) screenshot queries are faster to
WebNotes2 and Shiftspace3), where URLs and HTML page specify than keyword queries, and (2) results of screenshot
structure provide robust attachment points, but similar sys- and keyword search have roughly the same relevance as
tems for the desktop have previously required application judged by users. We also used a questionnaire to shed light
support (e.g. Stencils [8]). Using screenshots as queries, RQXVHUV¶VXEMHFWLYHH[SHULHQFHVof both search methods.
we can provide general-purpose GUI element annotation Method
for the desktop, which may be useful for both personal and The study was a within-subject design and took place
community contexts. For example, consider a dialog box online. Subjects were recruited from Craigslist and
for opening up a remote desktop connection. A user may compensated with $10 gift certificates. Each subject was
want to attach a personal note listing the IP addresses of the asked to perform two sets of five search tasks (1 practice +
remote machines accessible by the user, whereas a commu- 4 actual tasks). Each set of tasks corresponds to one of the
nity expert may want to create a tutorial document and link two conditions (i.e., image or keyword) that are randomly
the document to this dialog box. ordered. The details of a task are as follows. First, the
6LNXOL 6HDUFK¶V annotation interface allows a user to save
screenshots with custom annotations that can be looked up 4
http://www.pdfchm.com/
5
http://code.google.com/p/tesseract-ocr/
2 6
http://www.webnotes.com/ http://ferret.davebalmain.com/
3 7
http://shiftspace.org/ http://rubyonrails.org/
Figure 5: Retrieval performance of search prototype.
Figure 4: Mean and standard error of the subjec-
tive ratings of the two query methods.
iar than screenshot queries. There was a trend that subjects
subject was presented an image of the whole desktop with reported screenshot queries as easier to use (Q5) and to
an arbitrarily-positioned dialog box window. Each dialog learn (Q7) compared to keyword queries p < .1.
box was randomly drawn without replacement from the We observed several subjects improved speed in making
same pool of 10 dialog boxes. This pool was created by screenshot queries over several tasks, which may suggest
randomly choosing from those in our database known to that while they were initially unfamiliar with this method,
have relevant matches. Next, the subject was told to specify they were able to learn it rather quickly.
queries by entering keywords or by selecting a screen
Performance Evaluation
region depending on which condition. The elapsed time
We evaluated the technical performance of the Sikuli
between the first input event (keypress or mouse press) and
Search prototype, which employs the three-feature indexing
the submit action was recorded as the query formulation
scheme (surrounding text, embedded text, and visual fea-
time. Finally, the top 5 matches were shown and the subject
tures), and compared it to that of a baseline system using
was asked to examine each match and to indicate whether it
only traditional keyword search over surrounding text. The
seemed relevant or irrelevant (Figure 3).
evaluation used a set of 500 dialog box screenshots from a
After completing all the tasks, the subject was directed to tutorial website for Windows XP 8 (which was not part of
an online questionnaire to rate subjective experiences with the corpus used to create the database). For the keyword
the two methods on a 7-point Likert scale (7: most search baseline, we manually generated search terms for
positive). The questions were adapted from the evaluation each dialog box using words in the title bar, heading, tab,
of the Assieme search interface [6] and are listed below: and/or the first sentence in the instruction, removing stop
1. What is your overall impression of the system? words, and capping the number of search terms to 10.
2. How relevant are the results? We measured coverage, recall, and precision. Coverage
3. Does the presentation provide good overview of the results?
4. Does the presentation help you judge the relevance?
measures the likelihood that our database contains a rele-
5. Does the input method make it easy to specify your query? vant document for an arbitrary dialog box. Since the exact
6. Is the input method familiar? measurement of coverage is difficult given the size of our
7. Is the input method easy to learn? database, we examined the top 10 matches of both methods
and obtained an estimate of 70.5% (i.e., 361/500 dialogs
Results had at least one relevant document). To estimate precision
Twelve subjects, six males and six females, from diverse and recall, we obtained a ground-truth sample by taking the
backgrounds (e.g., student, waiter, retiree, financial consul- union of all the correct matches given by both methods for
tant) and age range (21 to 66, mean = 33.6, sd = 12.7), partici- the queries under coverage (since recall is undefined for
pated in our study and filled out the questionnaire. All but queries outside the coverage).
one were native English speakers. Figure 5 shows the precision/recall curves of the two me-
The findings supported both hypotheses. The average query thods. As can be seen, the screenshot method achieved the
formulation time was less than half as long for screenshots best results. We speculate that the keyword baseline per-
(4.02 sec, s.e.=1.07) as for keyword queries (8.58 sec, formed poorly because it only relies on the text surrounding
s.e.=.78), which is a statistically significant difference t(11) a screenshot that might not necessarily correlate with the
= 3.87, p = 0.003. The number of results rated relevant (out text actually embedded in the screenshot. The surrounding
of 5) averaged 2.62 (s.e.=.26) for screenshot queries and text often provides additional information rather than re-
2.87 (s.e.=.26) for keyword queries, which was not signifi- peating what is already visible in the screenshot. However,
cant t(11) = .76, p = .46. users often choose keywords based on the visible text in the
The responses to the questionnaire for subjective rating of dialog box and these keywords are less likely to retrieve
the two query methods are summarized in Figure 4. The documents with the screenshots of the right dialog box.
most dramatic difference was familiarity (Q6) t(11) = 4.33,
p < .001. Most subjects found keyword queries more famil- 8
http://www.leeindy.com/
SCREENSHOTS FOR AUTOMATION
This section presents Sikuli Script, a visual approach to UI
(a) (c)
automation by screenshots. We describe motivation, algo-
rithms for matching screenshot patterns, our visual script-
ing API, an editor for composing visual scripts, and several
example scripts. (b)
Motivation
The development of our visual scripting API for UI auto-
mation is motivated by the desire to address the limitations
of current automation approaches. Current approaches tend Figure 6: Examples of finding small patterns of
to require support from application developers (e.g., Ap- varing sizes (a) and colors (b) by template-
pleScript and Windows Scripting, which require applica- matching, and large patterns of varying sizes and
tions to provide APIs) or accessible text labels for GUI orientations (c) by invariant feature voting.
elements (e.g. DocWizards [2], Chickenfoot [3], and Co-
Scripter [10]). Some macro recorders (e.g. Jitbit9 and detecting cars and pedestrians in a street scene. This algo-
QuicKeys10) achieve cross-application and cross-platform rithm learns from a set of invariant local features extracted
operability by capturing and replaying low-level mouse and from a training pattern (a screenshot of a GUI element) and
keyboard events on a GUI element based on its absolute derives an object model that is invariant to scale and rota-
position on the desktop or relative position to the corner of tion. Encoded in the object model is the location of its cen-
its containing window. However, these positions may be- ter relative to each local feature. To detect this object mod-
come invalid if the window is moved or if the elements in el in a test image (screen), we first extract invariant features
the window are rearranged due to resizing. from the image. For each feature, we can look up the cor-
Therefore, we propose to use screenshots of GUI elements responding feature in the object model and infer where the
directly in an automation script to programmatically control location of the object center is if this feature actually con-
the elements with low-level keyboard and mouse input. stitutes a part of the object. If a cluster of features consis-
Since screenshots are universally accessible across different tently point at the same object center, it is likely these fea-
applications and platforms, this approach is not limited to a tures actually form an object. Such clusters can be identi-
specific application. Furthermore, the GUI element a pro- fied efficiently by voting on a grid, where each feature
grammer wishes to control can be dynamically located on casts a vote on a grid location closest to the inferred object
the screen by its visual appearance, which eliminates the center. We identify the grid locations with the most votes
movement problem suffered by existing approaches. and obtain a set of hypotheses. Each hypothesis can be ve-
Finding GUI Patterns on the Screen
rified for its geometric layout by checking whether a trans-
At the core of our visual automation approach is an effi- formation can be computed between this hypothesis and the
cient and reliable method for finding a target pattern on the training pattern based on the set of feature correspondences
screen. We adopt a hybrid method that uses template- between the two. The result is a set of matches and their
matching for finding small patterns and invariant feature positions, scales, and orientations, relative to the target pat-
voting for finding large patterns (Figure 6). tern. Note that while rotational invariance may be unneces-
sary in traditional 2D desktop GUIs, it can potentially bene-
If the target pattern is small, like an icon or button, tem- fit next-generation GUIs such as tabletop GUIs where ele-
plate matching based on normalized cross-validation [4] ments are oriented according to XVHUV¶YLHZLQJDQJOHs.
can be done efficiently and produce accurate results. Tem-
Visual Scripting API
plate matching can also be applied at multiple scales to find
Sikuli Script is our visual scripting API for GUI automa-
resized versions of the target pattern (to handle possible
tion. The goal of this API is to give an existing full-
changes in screen resolution) or at grayscale to find pat-
featured scripting language a set of image-based interactive
terns texturally similar but with different color palettes (to
capabilities. While in this paper we describe the API in
handle custom color themes).
Python syntax, adapting it to other scripting languages such
However, if the target pattern is large, like a window or as Ruby and JavaScript should be straightforward.
dialog box, template-matching might become too slow for
The Sikuli Script API has several components. The find()
interactive applications, especially if we allow variations in
function takes a target pattern and returns screen regions
scale. In this case, we can consider an algorithm based on
matching the pattern. The Pattern and Region classes
invariant local features such as SIFT [11] that have been
represent the target pattern and matching screen regions,
used to solve various computer vision problems successful-
respectively. A set of action commands invoke mouse and
ly over the past few years. The particular algorithm we
keyboard actions on screen regions. Finally, the visual dic-
have adapted for our purpose [13] was originally used for
tionary data type stores key-values pairs using images as
keys. We describe these components in more detail below.
9
http://www.jitbit.com/macrorecorder.aspx
10
http://www.startly.com/products/qkx.html
Find To support other types of constrained search, our visual
The find() function locates a particular GUI element to in- scripting API provides a versatile set of constraint opera-
teract with. It takes a visual pattern that specifies the ele- tors: left, right, above, below, nearby, inside, outside in
ment¶VDppearance, searches the whole screen or part of the 2D screen space and after, before in reading order (e.g.,
screen, and returns regions matching this pattern or false if top-down, left-right for Western reading order). These op-
no such region can be found. For example, find( ) returns erators can be used in combination to express a rich set of
regions containing a Word document icon. search semantics, for example,
Pattern
The Pattern class is an abstraction for visual patterns. A find( ).inside().find( ).right().find( ).
pattern object can be created from an image or a string of
text. When created from an image, the computer vision Action
algorithm described earlier is used to find matching screen The action commands specify what keyword and/or mouse
regions. When created from a string, OCR is used to find events to be issued to the center of a region found by find().
screen regions matching the text of the string. The set of commands currently supported in our API are:
An image-based pattern object has four methods for tuning x click(Region), doubleClick(Region): These two com-
how general or specific the desired matches must be: mands issue mouse-click events to the center of a tar-
x exact(): Require matches to be identical to the given get region. For example, click( ) performs a single
search pattern pixel-by-pixel. click on the first close button found on the screen.
x similar(float similarity): Allow matches that are Modifier keys such as Ctrl and Command can be
somewhat different from the given pattern. A similari- passed as a second argument.
ty threshold between 0 and 1 specifies how similar the x dragDrop(Region target, Region destination): This
matching regions must be (1.0 = exact). command drags the element in the center of a target
x anyColor(): Allow matches with different colors than region and drops it in the center of a destination region.
the given pattern. For example, dragDrop( , ) drags a word icon
x anySize(): Allow matches of a different size than the and drops it in the recycle bin.
given pattern. x type(Region target, String text): This command enters
Each method produces a new pattern, so they can be a given text in a target region by sending keystrokes to
chained together. For example, its center. For example, type(
,µUISTµ) types the ³UIST´ in the Google search box.
Pattern( ).similar(0.8).anyColor().anySize()
matches screen regions that are 80% similar to of any Visual Dictionary
size and of any color composition. Note that these pattern A visual dictionary is a data type for storing key-value pairs
methods can impact the computational cost of the search; using images as keys. It provides Sikuli Script with a con-
the more general the pattern, the longer it takes to find it. venient progrDPPLQJ LQWHUIDFH WR DFFHVV 6LNXOL 6HDUFK¶V
core functionality. Using a visual dictionary, a user can
Region
easily automate the tasks of saving and retrieving data
The Region class provides an abstraction for the screen
based on images. The syntax of the visual dictionary is
region(s) returned by the find() function matching a given
modeled after that of the built-in Python dictionary. For
visual pattern. Its attributes are x and y coordinates, height,
width, and similarity score. Typically, a Region object example, d = VisualDict({ : "word", : "powerpoint"})
represents the top match, for example, r = find( ) finds creates a visual dictionary associating two application
the region most similar to and assigns it to the variable names with their icon images. Then, d[ ] retrieves the
r. When used in conjunction with an iterative statement, a string powerpoint and d[ ] = "excel" stores the string
Region object represents an array of matches. For example,
excel under . Because is not a key, in d re-
for r in find( iterates through an array of matching re-
)
gions and the programmer can specify what operations to turns false and d[ ] raises a KeyError exception. Using
perform on each region represented by r. the pattern modifiers described earlier, it is possible to ex-
plicitly control how strict or fuzzy the matching criterion
Another use of a Region object is to constrain the search to
a particular region instead of the entire screen. For exam- should be. For instance, d[Pattern( ).exact()] requires
pixel perfect matching, whereas d[Pattern( ).anysize()]
ple, find( ).find( ) constrains the search retrieves an array of values associated with different sizes
space of the second find() for the ok button to only the re- of the same image.
gion occupied by the dialog box returned by the first find().
Figure 8: The user can adjust the similarity threshold
and preview the results. Here, the threshold (0.25) is
too low, resulting in many false positives.
Figure 7: Editor for writing Sikuli scripts in Python.
Script Editor level scripting syntax described above. Libraries for other
We developed an editor to help users write visual scripts high-level scripting languages can also be built based on
(Figure 7). To take a screenshot of a GUI element to add to this API. We built the editor using Java Swing. To visual-
a script, a user can click on the camera button (a) in the ize the search patterns embedded in a script, we imple-
toolbar to enter the screen capture mode. The editor hides mented a custom EditorKit to convert from pure-text to
itself automatically to reveal the desktop underneath and rich-text view. Finally, we used Jython to execute the Py-
the user can draw a rectangle around an element to capture thon scripts written by programmers. All components of
its screenshot. The captured image can be embedded in any the system are highly portable and have been tested on
statement and displayed as an inline image. The editor also Windows and Mac OS X. We benchmarked the speed on a
provides code completion. When the user types a com- 3.2 GHz Windows PC. A typical call to find() for a
mand, the editor automatically displays the corresponding 100x100 target on a 1600x1200 screen takes less than 200
command template to remind the user what arguments to msec, which is reasonable for many interactive applica-
supply. For example, when the user types find, the editor tions. Further speed gains might be obtained by moving
functionality to the GPU if needed in the future.
will expand the command into . The user can
click on the camera button to capture a screenshot to be the Sikuli Script Examples
argument for this find() statement. Alternatively, the user We present six example scripts to demonstrate the basic
can load an existing image file from disk (b), or type the features of Sikuli Script. For convenience in Python pro-
filename or URL of an image, and the editor automatically gramming, we introduce two variables; find.region and
loads it and displays it as a thumbnail. The editor also al- find.regions, that respectively cache the top region and all
lows the user to specify an arbitrary region of screen to the regions returned by the last call to find. While each
confine the search to that region (c). Finally, the user can script can be executed alone, it can also be integrated into a
press the execute button (d) and the editor will be hidden larger Python script that contains calls to other Python li-
and the script will be executed. braries and/or more complex logic statements.
The editor can also preview how a pattern matches the cur- 1. Minimizing All Active Windows
rent desktop (Figure 8) under different parameters such as
similarity threshold (a) and maximum number of matches
(b), so that these can be tuned to include only the desired
regions. Match scores are mapped to the hue and the alpha
value of the highlight, so that regions with higher score are
redder and more visible.
Implementation and Performance 1: while find( ):
We implemented the core pattern matching algorithm in 2: click(find.region)
C++ using OpenCV, an open-source computer vision li- This script minimizes all active windows by calling find
brary. The full API was implemented in Java using the Java repeatedly in a while loop (1) and calling click on each
Robot class to execute keyboard and mouse actions. Based minimize button found (2), until no more can be found.
on the Java API, we built a Python library to offer high-
2. Deleting Documents of Multiple Types 4. Navigating a Map
100: d[ ]=
101: import win32gui
102: while true:
103: w = win32gui.getActiveWindow()
104: img = getScreenshot(w)
105: if img in d:
106: button = d[img]
107: click(Region(w).inside().find(button))
1: street_corner = find( ) This script generates automatic responses to a predefined
2: while not street_corner.inside().find( ).similar(0.7): set of message boxes. A screenshot of each message box is
3: sleep(60) stored in a visual dictionary d as a key and the image of the
SRSXS´7KHEXVLVDUULYLQJµ button to automatically press is stored as a value. A large
This script tracks bus movement in the context of a GPS- number of message boxes and desired responses are de-
based bus tracking application. Suppose a user wishes to be fined in this way (1-100). Suppose the win32gui library is
notified when a bus is just around the corner so that the imported (101) to provide the function getActiveWin-
user can head out and catch the bus. First, the script identi- dow(), which is called periodically (102) to obtain the han-
fies the region corresponding to the street corner (1). Then, dle to the active window (103). Then, we take a screenshot
it enters a while loop and tries to find the bus marker inside by calling getScreenshot() (104) and check if it is a key of
the region every 60 seconds (2-3). Notice that about 30% of d (105). If so, this window must be one of the message
the marker is occupied by the background that may change boxes specified earlier. To generate an automatic response,
as the maker moves. Thus, the similar pattern modifier is the relevant button image is extracted from d (106) and the
used to look for a target 70% similar to the given pattern. region inside the active window matching the button image
Once such target is found, a popup will be shown to notify is found and clicked (107). This example shows Sikuli
the user the bus is arriving (4). This example demonstrates Script can interact with any Python library to accomplish
6LNXOL6FULSW¶VSRWHQWLDOWRKHOSwith everyday tasks. tasks neither can do it alone. Also, using a VisualDict, it is
possible to handle a large number of patterns efficiently.
6. Monitoring a Baby