0% found this document useful (0 votes)
71 views8 pages

SourceQL Paper 3: Progress Report 1 For EECS 395 (Senior Project)

Builds on paper 2 with details about disk-level index structures and a regular expression engine for tries.

Uploaded by

Steve Johnson
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views8 pages

SourceQL Paper 3: Progress Report 1 For EECS 395 (Senior Project)

Builds on paper 2 with details about disk-level index structures and a regular expression engine for tries.

Uploaded by

Steve Johnson
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

SourceQL Progress Report 1

Tim Henderson, Steve Johnson, and Evan Krall


Case Western Reserve University
March 2, 2010

1 Summary
We are writing a system to perform an efficient regular expression search over a version con-
trol repository. This system is part of a larger effort to build a tool to mine data from version
control systems with the goal of providing useful metrics about the code, the programmers,
and the team organization. Accomplishing these tasks efficiently requires special indexes
and tight integration between components.
The overarching project goal is huge, and the subset of functionality chosen for this course
is not insignificant. Since the proposal was written, the schedule was adjusted to match new
discoveries about what needs to be accomplished, but the team is still on track to complete
the stated task within the allotted time. Specifically, the system is still expected to perform
a full regular expression search over a source control repository in significantly better time
than the naive approach.

2 Overall Design
SourceQL has many components which, while developed in isolation, must work together to
construct the results of the user’s queries. To ease the division of labor and pursue a clean
design, we have designed each component to be as independent as possible.

2.1 Components
2.1.1 SourceGraph
The commit graph of an SCM is a directed acyclic graph. It can be very large, and most
SCMs do not optimize for access time on commit edges. To help queries run faster, we will
index graph metadata including edges between nodes, committer name, and commit time.
SourceGraph is the name of the component which maintains the graph index and provides
access to it for other components.

1
The algorithms for maintaining the index, specifying subgraphs from commit metadata,
and performing algebraic operations are all defined in final report for EECS 433. For this
project, the algorithms and indexes will be prototyped naively in Python, since the size of
the graph usually does not exceed 20,000 or so nodes and can be loaded into memory quickly.
Commit metadata will eventually be stored in a disk-based B+ tree.

2.1.2 ReTrie
The baseline query target is a full regular expression search across every commit and file
in the SCM. Running regular expressions on each item in each diff takes an unacceptable
amount of time, so SourceQL includes a full-text exact match index. An in-memory trie
structure is being used to index every line under version control, with plans to move to a
disk-based trie. The match will be performed by a virtual machine which executes an NFA,
which is generated from a regular expression by a compiler.
The in-memory trie structures and compiler have been implemented. They still need to
be connected.

2.1.3 SCM integration


The subgraph selection and ReTrie components both require access to the data in the SCM.
Hooks have been implemented to pass data from the Mercurial SCM to the various compo-
nents of SourceQL. These hooks are currently using dummy structures which are stubs for
interfaces to the real data structures (for example, a hash table in place of a trie). Much
of the work for the next month is substituting the real data structures for the dummy data
structures.

2.1.4 Results projection


With the above three components finished or nearly finished, SourceQL will soon support
hard-coded queries. When this happens, a spec for a JSON-based output format will be
written and implemented.

2.1.5 Query language and query processor


The ultimate project goal is to provide users with a domain-specific query language for
version control systems, so a compiler and query processor will need to be written. It is not
a high priority for this semester, but the baseline is to be able to perform a regular expression
search over a user-specified subgraph, so some amount of parsing and query processing will
be necessary.
The query language will require a language spec and compiler. The query processor will
define exact relationships between the components, stream instructions and data between
components, and optimize queries so they perform as efficiently as possible against the file
structures.

2
path + regex query

query parser

ReT rie
SourceGraph

trie commit path query


regex compiler
index graph compiler

regex VM graph query VM

matches matches
(rev_id, file, line #, text) (rev_id)

join

matches
(rev_id, file, line #, text)

Figure 1: Example flow of a query: match a regular expression on a specified subgraph


specified by the syntax defined in our EECS 433 paper. This diagram also shows the internal
structure of ReTrie and SourceGraph.

2.2 Example Query


The flow of data through SourceQL for an example query is shown in figure 1. This query is
of the form, “Find all nodes on a path between commits X and Y which match the regular
expression R.” ReTrie and SourceGraph independently generate subgraph forests for their
criteria, and those results are joined. The results of the join are returned to the user.

3 New Tools and Discoveries


To help the team keep track of bugs, feature requests, and current tasks, a FogBugz student
account was acquired. FogBugz is a tool used in the software development industry for time
tracking, support case handling, deadline estimation, and task management.
Since the proposal was written, the team discovered that Mercurial already has a feature

3
February March April

1 8 5 22 1 9 16 23 30 5 12 19

Tim File structures Algos Pipe instruction data


Fix bugs,
prepare results
Evan B-trie, in-memory trie Query lang Define result format

Steve SCM integration (Hg) Get queries running Query lang compiler

Figure 2: Original estimated schedule

which allows a user to do a full regular expression search across a repository’s history.
However, the process is unacceptably slow for a large repository. This tool will be used
to benchmark SourceQL and demonstrate its effectiveness.
In addition to software tools, the EECS 405: File Structures course has been very helpful
to Tim and Steve with implementing efficient index structures.

4 Schedule Progress
Of our initial projections for the progress of the three team members, only one was met.
Steve finished the Mercurial hooks, providing a way to get data from a repository into an
index structure. He wrote some simple, inefficient data structures to temporarily replace the
indexes which will be used in the final product. These include a hash table to replace the
trie, a serialized in-memory graph to replace the graph index, and an append-only log file
to store text data from a commit. When the index structures are ready, it will be relatively
simple to replace the temporary code with interfaces to the efficient index structures.
Evan has been making progress implementing and testing in-memory B-tries, Patricia
tries, and suffix trees. These are all nearing completion and will soon be ready for testing.
Tim has been working on the low-level disk-based index structures. This has proven to
be a larger amount of work than originally expected. This functionality does not exist as a
library for the Go language, so much of the work is breaking new ground with the language,
which is itself still in beta. A full description of his work with blocks and B-trees is in section
5.1.
To help get the schedule moving forward, Steve will be working with Tim on blocks and
related work. Some padding was intentionally built into the month of March in the original
schedule, so minor slippage is acceptable.

4
Block operations

OS buffer

Our buffer
keys 4 5 5 5 8 9 9

records

[]byte data
pointers uint32 position
uint32 length

Extra Currently LRU or LFU

(a) A block (b) A buffer

Figure 3: Block data structures

5 Design and Implementation of New Components


5.1 Blocks
5.1.1 Motivation
The SourceQL design contains multiple indexes representing different relations and types
of data. The long-term plan includes a disk-based trie, a few B+ trees of relational data,
and a sequential file to store commit metadata. Each of these will require a different index
structure.

5.1.2 Design
All index structures will be based on the standard file structure concepts of blocks and files.
To support as many high-level structures as possible, the implementation of blocks is as
general as possible. The current design is shown in figure 3(a).
Blocks are organized into three segments: keys, records, and pointers. Keys are used to
reference records, as well as calculate pointer positions. Not all index structures require a
pointer for each key, so pointers are referenced separately from records, and the programmer
can choose to either keep a pointer for each key plus one, or just one pointer to point to
another block. These options allow our blocks to support B+ trees and sequential files easily
without wasting space.
For disk-based storage, blocks are serialized to byte strings and stored in files. However,
since many disk-based index structures do not perform optimally with the operating system’s
I/O buffer, the SourceQL design includes an extra layer over the file which implements a
custom buffer system. Figure 3(b) shows the design.

5
Figure 4: Output of a B-tree visualization generated by unit tests

As shown in the diagram, block operations bypass the OS buffer and go to SourceQL’s
write-through buffer. For reading, the implementation can currently use LRU or LFU.

5.1.3 B-Tree
There is a working, unit tested implementation of B-trees which supports insertions and
reads. The unit tests produce visualizations of the trees they use in order to make it easier
to check for mistakes. One such visualization is shown in figure 4.

5.2 Mercurial Hooks and Support


5.2.1 Installation
The user installs all SourceQL executables to an accessible directory and appends the instal-
lation path to the PYTHONPATH environment variable. Then, for each repository that is to be
indexed, the following lines are added to my repository/.hg/hgrc:

[hooks]
commit.sourceql_commit = python:sourceql.commit
changegroup.sourceql_changegroup = python:sourceql.changegroup
update.sourceql_update = python:sourceql.update

5.2.2 Update Process


This text in section 5.2.1 instructs Mercurial to call certain SourceQL functions when the
user commits changes or brings in commits from another repository. Mercurial passes the
earliest common ancestor of all new changes to these functions. When the functions are
called, the following things happen:

• The commit graph is loaded in its entirety from the disk. This will eventually be
changed to use an efficient disk-based graph index.

• The commit graph is traversed starting at the node passed in by Mercurial.

• The node is added to SourceQL’s commit graph representation.

6
• For each new node, the diff between it and its parent is parsed. Each line is handled
according to the criteria specified in section 5.2.3.

• The commit graph representation is saved naively to the disk.

5.2.3 Diffs
This section deals only with how diffs are handled. For a full explanation of what a diff is and
how it represents changes to text files, please see http://en.wikipedia.org/wiki/Diff.
A diff is a listing of lines added and deleted between one text file and another. In terms
of a repository, a diff represents an edge between two commits, since a commit is a
capturing of a state and a diff is the transition between two states. This means that a merge
will have two sets of diffs, one for each parent.
Accurately representing what a diff does requires three pieces of data: the parent, the
child, and whether the line was added or removed. Lists of these tuples will need to be
accessed by the regex VM when it reaches a leaf of the trie, so it is important that the
indexing methods are efficient. The current design specifies that each string be given a
unique ID, generated when the string is first inserted into the trie, which is a key for one
or more entries in a B+ tree. These entries will be 3-tuples of the form (parent, child,
ADDED OR REMOVED).
This storage schema only stores where lines are added and removed, not every single
commit in which a line exists. Returning all commits which contain the given line is simply
an outer join of the results of a series of graph traversals. The algorithm is still under
development to handle all corner cases, but it should run in linear time.

5.3 Trie
In order to allow quick access to records based on the string contents of a line, we need
an appropriate index. We chose trie-based indexes, as they lend themselves to the task
of regular expression searching. Hash tables would allow amortized O(1) access time, but
require knowing the full text of a line at search time, which make them inappropriate for
our purposes. B-trees and B+ trees are better, as they allow quick access to a record given
a prefix of the line. However, given a regular expression, there may be many valid prefixes,
in which case the B-tree’s performance would start to degrade, as you would either have to
search for each prefix, or search for one and traverse along the leaves, checking each possible
line against the regular expression.
On the other hand, trie-based structures provide a very easy and straightforward way to
search with a regular expression. A trie is a tree where each node represents a set of strings
having a common prefix. The paths between nodes represent the parts of that prefix. In a ba-
sic trie, each edge is one character. This can cause a waste of space; every character in a string
requires a node. This overhead means a simple trie can take many times more space than
the original data, which is unacceptable for us. A simple improvement is to label edges with
strings, rather than characters. This modification forms what is known as a Patricia trie.

7
In a Patricia trie, nodes only exist where strings di-
verge, rather than for every character. This means that
edges representing long strings take little more space 
than the string itself.
Our application requires storing very large sets of  
strings, which may be larger than the entire main mem-
ory of the host computer. This calls for a disk-based
 
structure, but simply storing a Patricia or plain trie
on disk is expensive, since there would be one disk ac-
cess for each node, which could be many thousands     
when searching with a regular expression. Fortunately,
there is a trie-based structure called the B-trie, devel-
oped by Nikolas Askitis and Justin Zobel, which pro-
 
vides good performance on disk and retains the ad-
vantages of a plain trie when searched with a regular  
expression. A B-trie is based around a set of buckets,
each containing a set of strings that have a common
prefix. The buckets are indexed using a trie structure  
that is built every time a bucket runs out of space and
needs to be split.
Figure 5: A patricia trie storing the
5.4 Trie Traversal VM strings “program”, “programmers”,
“programming”, “algorithm”, and
In order to exploit the redundancy-eliminating na-
“application.”
ture of the trie, SourceQL must use a redundancy-
eliminating algorithm to match. The algorithm
developed is based on the one designed by Ken
Thompson and explained by Russ Cox (http://swtch.com/~rsc/regexp/regexp2.html),
with differences to account for node splits in the trie.
SourceQL’s algorithm is still under development due to the incomplete state of the im-
plementation, but in essence, the regex search is run as a VM in which the VM splits itself
at every choice in the regex DFSM, and at every choice in the trie. This approach allows the
algorithm to exploit massive parallelism and caching. It has been theoretically and practi-
cally proven to have a significantly better running time than a search over a list of strings
with Python standard library regex functions.

5.5 Conclusion
SourceQL is unique as a senior project in that it was almost completely designed and archi-
tected even before the semester began. The main challenge is to maintain a good working
speed and sound development practices to meet deadlines and avoid bugs. As soon as the
VM is working, the project will have the ability to run queries.

You might also like