Write A Computer Language Using Go (Golang)
Write A Computer Language Using Go (Golang)
Thorsten Ball
Chapter 1
Lexing
1.1 - Lexical Analysis
In order for us to work with source code we need to turn it into a more accessible form. As easy as plain text
is to work with in our editor, it becomes cumbersome pretty fast when trying to interpret it in a programming
language as another programming language.
So, what we need to do is represent our source code in other forms that are easier to work with. Were going
to change the representation of our source code two times before we interpret it:
And what comes out of the lexer looks kinda like this:
[
LET,
IDENTIFIER(x),
EQUAL_SIGN,
INTEGER(5),
PLUS_SIGN,
INTEGER(5),
SEMICOLON
]
All of these tokens have the original source code representation attached ("let" in the case of LET, "+" in the
case of PLUS_SIGN, and so on). Some, like IDENTIFIER and INTEGER in our example, also have the concrete
values they represent attached: 5 (not "5"!) in the case of INTEGER and "x" in the case of IDENTIFIER. But
that last part varies from lexer to lexer (e.g. some convert the "5" to an integer in the parser, or even later).
1
A thing to note about this example: whitespace has been ignored. In our case thats okay, because whitespace
is not significant in the Monkey language. It doesnt matter if we type let x = 5; or let
x
=
5;.
In other languages, like Python, the whitespace is significant. That means the lexer cant just eat up the
whitespace and newline characters. It has to output them as tokens so the parser can later on make sense
of them (or output an error, of course, if there is not enough or too much whitespace).
A production-ready lexer might also attach the line number, column number and filename to a token. Why?
For example, to later output more useful error messages in the parsing stage. Instead of "error: expected
semicolon token" it can output "error: expected semicolon token instead of plus sign. line
42, column 23, foobar.blub".
Were not going to bother with that. Not because its too complex, but because it would take away from the
essential simpleness of the tokens and the lexer, making it harder to understand.
Lets break this down: which types of tokens does this example contain? First of all, there are the numbers
like 5 and 10. These are pretty obvious. Then we have the variable names x, y, add and result. There are
also these parts of the language that are not numbers, just words, but no variable names either, like let and
fn. Of course, there are also a lot of special characters: (, ), {, }, =, ,, ;.
The numbers are just integers and were going to treat them as such and give them a separate type. In the
lexer or parser we dont care if the number is 5 or 10, we just want to know if its a number. The same goes
for variable names: well call them identifiers and treat them the same. Now, the other words, that look
like identifiers, but arent really identifiers, since theyre part of the language, are keywords. We wont
group these together since it should make a difference in the parsing stage whether we encounter a let or
a fn. The same goes for the last category we identified: the special characters. Well treat them separately,
since it is a big difference whether or not we have a ( or a ) in the source code.
Lets define our Token data structure. Which fields does it need? As we just saw, we definitely need a type
attribute, so we can distinguish between integers and right bracket for example. And it also needs a field
that holds the literal value of the token, so we can reuse it later and the information whether a number
token is a 5 or a 10 doesnt get lost.
In a new token package we define our Token struct and our TokenType type:
// token/token.go
package token
type TokenType string
type Token struct {
Type
TokenType
Literal string
}
We defined the TokenType type to be a string. That allows us to use many different values as TokenTypes,
which in turn allows us to distinguish between different types of tokens. Using string also has the advantage
of being easy to debug without a lot of boilerplate and helper functions: we can just print a string. Of
course, using a string might not lead to the same performance as using an int or a byte would, but for
this book a string is perfect.
As we just saw, there is a limited number of different token types in the Monkey language. That means we
can define the possible TokenTypes as constants. In the same file we add this:
// token/token.go
const (
ILLEGAL = "ILLEGAL"
EOF
= "EOF"
// Identifiers + literals
IDENT = "IDENT" // add, foobar, x, y, ...
INT
= "INT"
// 1343456
// Operators
ASSIGN
= "="
PLUS
= "+"
// Delimiters
COMMA
= ","
SEMICOLON = ";"
LPAREN
RPAREN
LBRACE
RBRACE
=
=
=
=
"("
")"
"{"
"}"
// Keywords
FUNCTION = "FUNCTION"
LET
= "LET"
)
As you can see there are two special types: ILLEGAL and EOF. We didnt see them in the example above, but
well need them. ILLEGAL signifies a token/character we dont know about and EOF stands for end of file,
which tells our parser later on that it can stop.
So far so good! We are ready to start writing our lexer.
to initialize the lexer with a *os.File or a set of them. But since that would add more complexity were
not here to handle, well start small and just use a string and ignore filenames and line numbers.
Having thought this through, we now realize that what our lexer needs to do is pretty clear. So lets create
a new package and add a first test that we can run continuously run to get feedback about the working of
the lexer. Were starting small here and will extend the test-case as we add more capabilities to the lexer:
// lexer/lexer_test.go
package lexer
import (
"testing"
"monkey/token"
)
func TestNextToken(t *testing.T) {
input := `=+(){},;`
tests := []struct {
expectedType
token.TokenType
expectedLiteral string
}{
{token.ASSIGN, "="},
{token.PLUS, "+"},
{token.LPAREN, "("},
{token.RPAREN, ")"},
{token.LBRACE, "{"},
{token.RBRACE, "}"},
{token.COMMA, ","},
{token.SEMICOLON, ";"},
{token.EOF, ""},
}
var l *Lexer
l = New(input)
for i, tt := range tests {
tok := l.NextToken()
if tok.Type != tt.expectedType {
t.Errorf("tests[%d] - tokentype wrong. expected=%q, got=%q",
i, tt.expectedType, tok.Type)
break
}
if tok.Literal != tt.expectedLiteral {
t.Errorf("tests[%d] - literal wrong. expected=%q, got=%q",
i, tt.expectedLiteral, tok.Literal)
break
}
}
}
Of the course, the tests fail we havent written any code yet:
% go test ./lexer
# monkey/lexer
lexer/lexer_test.go:31: undefined: New
lexer/lexer_test.go:34: l.NextToken undefined (type *Lexer has no field or method NextToken)
FAIL
monkey/lexer [build failed]
Most of the fields in Lexer are pretty self-explanatory. The ones that might cause some confusion right now
are position and readPosition. The reason for two pointers pointing into our input string is the fact
that we we will need to be able to peek further into the input, look after the current rune to see what
comes up next. readPosition always points to the next character in the input. position points to the
character in the input that corresponds to the ch rune. A first helper method called readRune() should
make the usage of these fields easier to understand:
// lexer/lexer.go
import "unicode/utf8"
func (l *Lexer) readRune() {
if l.readPosition >= len(l.input) {
l.ch = -1
l.position = l.readPosition
return
}
character, width := utf8.DecodeRuneInString(l.input[l.readPosition:])
l.ch = character
l.position = l.readPosition
l.readPosition += width
}
The first thing readRune does is to check whether we reached the end of our input and cant read any
more runes. If thats the case then it sets l.ch to -1, which signifies end of file for us. It also updates
l.position to the last read position, so that later know how far we read into the input before reaching the
end. And then it returns.
But in case we havent reached the end of the input yet readRune reads the next rune in our input.
It does this by calling utf8.DecodeRuneInString. The reason for this call instead of just accessing
l.input[l.readPosition] is proper UTF-8 and Unicode support. Using l.input[l.readPosition] to get
the next character would not work when that character is multiple bytes wide. By using DecodeRuneInString
we leverage the power the rune data type gives us, which is being able to represent characters that are multiple bytes wide.
The l.ch field gets the updated to the read rune, l.position is updated to the just used l.readPosition
and l.readPosition is advanced by the width of the just-read rune. That way, l.readPosition always
5
points to the next position where were going to read from next and l.position always points to the position
where we last read. This will come in handy soon enough.
Lets use readRune in our New() function so our lexer is in a fully working state, with l.ch, l.position
and l.readPosition already initialized, before anyone calls NextToken():
// lexer/lexer.go
func New(input string) *Lexer {
l := &Lexer{
input:
input,
position:
0,
readPosition: 0,
ch:
-1,
}
l.readRune()
return l
}
Our tests should now tell us that calling New(input) doesnt result in problems anymore, but the
NextToken() method is still missing. Lets fix that by adding a first version:
// lexer/lexer.go
import (
"monkey/token"
"unicode/utf8"
)
func (l *Lexer) NextToken() token.Token {
var tok token.Token
switch l.ch {
case '=':
tok = newToken(token.ASSIGN, l.ch)
case ';':
tok = newToken(token.SEMICOLON, l.ch)
case '(':
tok = newToken(token.LPAREN, l.ch)
case ')':
tok = newToken(token.RPAREN, l.ch)
case ',':
tok = newToken(token.COMMA, l.ch)
case '+':
tok = newToken(token.PLUS, l.ch)
case '{':
tok = newToken(token.LBRACE, l.ch)
case '}':
tok = newToken(token.RBRACE, l.ch)
case -1:
tok.Literal = ""
tok.Type = token.EOF
}
l.readRune()
return tok
}
func newToken(tokenType token.TokenType, r rune) token.Token {
return token.Token{Type: tokenType, Literal: string(r)}
Thats the basic structure of the NextToken() method. We look at the current rune under examination
(l.ch) and return a token depending on which rune it is. Before returning the token we advance our pointers
into the input so when we call NextToken() the l.ch field is already updated. A small function called
newToken helps us initializing these tokens.
Running the tests we can see that they pass:
% go test ./lexer
ok
monkey/lexer 0.007s
Great! Lets now extend the test case so it starts to resemble Monkey source code.
// lexer/lexer_test.go
func TestNextToken(t *testing.T) {
input := `let five = 5;
let ten = 10;
let add = fn(x, y) {
x + y;
};
let result = add(five, ten);
`
tests := []struct {
expectedType
token.TokenType
expectedLiteral string
}{
{token.LET, "let"},
{token.IDENT, "five"},
{token.ASSIGN, "="},
{token.INT, "5"},
{token.SEMICOLON, ";"},
{token.LET, "let"},
{token.IDENT, "ten"},
{token.ASSIGN, "="},
{token.INT, "10"},
{token.SEMICOLON, ";"},
{token.LET, "let"},
{token.IDENT, "add"},
{token.ASSIGN, "="},
{token.FUNCTION, "fn"},
{token.LPAREN, "("},
{token.IDENT, "x"},
{token.COMMA, ","},
{token.IDENT, "y"},
{token.RPAREN, ")"},
{token.LBRACE, "{"},
{token.IDENT, "x"},
{token.PLUS, "+"},
{token.IDENT, "y"},
{token.SEMICOLON, ";"},
{token.RBRACE, "}"},
{token.SEMICOLON, ";"},
{token.LET, "let"},
{token.IDENT, "result"},
{token.ASSIGN, "="},
{token.IDENT, "add"},
{token.LPAREN, "("},
{token.IDENT, "five"},
{token.COMMA, ","},
{token.IDENT, "ten"},
{token.RPAREN, ")"},
{token.SEMICOLON, ";"},
{token.EOF, ""},
}
var l *Lexer
l = New(input)
for i, tt := range tests {
tok := l.NextToken()
if tok.Type != tt.expectedType {
t.Errorf("tests[%d] - tokentype wrong. expected=%q, got=%q",
i, tt.expectedType, tok.Type)
break
}
if tok.Literal != tt.expectedLiteral {
t.Errorf("tests[%d] - literal wrong. expected=%q, got=%q",
i, tt.expectedLiteral, tok.Literal)
break
}
}
}
The updated input we initialize the lexer with is a subset of the Monkey language. It contains all the symbols
we already successfully turned into tokens. But whats been added now causes our tests to fail: identifiers,
keywords and numbers.
Lets start with the identifiers and keywords. What our lexer needs to do is recognize whether the current
rune under examination is a letter and if so, it needs to read the rest of the identifier/keyword until it
encounters a non-letter-character. Having read that identifier/keyword, we then need to find out if it is a
identifier or a keyword, so we can use the correct token.TokenType. The first step is extending our switch
statement:
// lexer/lexer.go
import (
"monkey/token"
"unicode"
"unicode/utf8"
)
func (l *Lexer) NextToken() token.Token {
var tok token.Token
switch l.ch {
// [...]
default:
if isLetter(l.ch) {
tok.Literal = l.readIdentifier()
return tok
} else {
tok = newToken(token.ILLEGAL, l.ch)
}
}
// [...]
}
func (l *Lexer) readIdentifier() string {
position := l.position
for isLetter(l.ch) {
l.readRune()
}
return l.input[position:l.position]
}
func isLetter(ch rune) bool {
return 'a' <= ch && ch <= 'z' || 'A' <= ch && ch <= 'Z' ||
ch == '_' ||
ch >= utf8.RuneSelf && unicode.IsLetter(ch)
}
We added a default arm to our switch statement, so we can check for identifiers whenever the l.ch is not
one of the recognized runes. We also added the generation of token.ILLEGAL tokens. If we end up there,
we truly dont know how to handle the current rune and declare it as token.ILLEGAL.
The isLetter helper function just checks whether the given argument is a letter. It does this by checking
checking whether its a normal ASCII letter or if its a unicode letter. Thats just a little dance we have to
do, to support Unicode.
Whats noteworthy about isLetter is that changing this function has a larger impact on the language our
interpreter will be able to parse than one would expect from such a small function. As you can see, in our
case it contains the check ch == '_', which means that well treat _ as a letter and allow it in identifiers
and keywords. That means we can use variable names like foo_bar. Other programming languages even
allow ! and ? in identifiers. If you want to allow that too, this is the place to sneak it in, probably in a
separate function called isValidIdentifierCharacter that can be used in combination with isLetter.
readIdentifier() does exactly what its name suggests: it reads in an identifier and advances our lexers
positions until it encounters a non-letter-character.
In the default: arm of the switch statement we use readIdentifier() to set the Literal field of our
current token. But what about its Type? Now that we have read in identifiers like let, fn or foobar, we
need to be able to tell user-defined identifiers apart from language keywords. We need a function that returns
the correct TokenType for the token literal we have. What better place than the token package to add such
a function?
// token/token.go
var keywords = map[string]TokenType{
"fn": FUNCTION,
"let": LET,
}
func LookupIdent(ident string) TokenType {
if tok, ok := keywords[ident]; ok {
return tok
}
return IDENT
}
LookupIdent checks the keywords table to see whether the given identifier is in fact a keyword. If it is, it
returns the keywords TokenType constant. If it isnt, we just get back token.IDENT, which is the TokenType
for all user-defined identifiers.
With this in hand we can now complete the lexing of identifiers and keywords:
// lexer/lexer.go
func (l *Lexer) NextToken() token.Token {
var tok token.Token
switch l.ch {
// [...]
default:
if isLetter(l.ch) {
tok.Literal = l.readIdentifier()
tok.Type = token.LookupIdent(tok.Literal)
return tok
} else {
tok = newToken(token.ILLEGAL, l.ch)
}
}
// [...]
}
The early exit here, our return tok statement, is needed because when calling readIdentifier(), we call
readRune() repeatedly and advances our readPosition and position fields past the last character/rune
of the current identifier.
Running our tests now, we can see that let is identified correctly but the tests still fail:
% go test ./lexer
--- FAIL: TestNextToken (0.00s)
lexer_test.go:70: tests[1] - tokentype wrong. expected="IDENT", got="ILLEGAL"
FAIL
FAIL
monkey/lexer 0.008s
The problem is the next token we want: a IDENT token with "five" in its Literal field. Instead we get an
ILLEGAL token. Why is that? Because of the whitespace character between let and five. But in Monkey
whitespace is not significant. It shouldnt make a difference if we write let five or let
five. So we
need to skip over whitespace entirely.
// lexer/lexer.go
func (l *Lexer) NextToken() token.Token {
var tok token.Token
l.skipWhitespace()
switch l.ch {
// [...]
}
func (l *Lexer) skipWhitespace() {
for l.ch == ' ' || l.ch == '\t' || l.ch == '\n' || l.ch == '\r' {
l.readRune()
}
}
This little helper function is found in a lot of parsers. Sometimes its called eatWhitespace and sometimes
consumeWhitespace and sometimes something entirely different. Which characters these functions actually
skip depends on the language. Some languages do create tokens for newline characters for example and throw
parsing errors if they are not at the correct place in the stream of tokens. We skip over newline characters
to make the parsing step later on a little easier. Also if you havent noticed, theres still a bug in there: one
of these invisible Unicode characters, which is effectively whitespace in this context, would cause our lexer
to trip and output an ILLEGAL token. Again: for simplicitys sake well ignore these for now. Whitespace in
10
the Monkey language consists of the space, tab, newline and carriage return characters.
With skipWhitespace() in place, the lexer trips over the 5 in the let five = 5; part of our test input.
And thats right, it doesnt know yet how to turn numbers into tokens. Its time to add this.
As we did previously for identifiers, we now need to add more functionality to the default arm of our switch
statement.
// lexer/lexer.go
func (l *Lexer) NextToken() token.Token {
var tok token.Token
l.skipWhitespace()
switch l.ch {
// [...]
default:
if isLetter(l.ch) {
tok.Literal = l.readIdentifier()
tok.Type = token.LookupIdent(tok.Literal)
return tok
} else if isDigit(l.ch) {
tok.Type = token.INT
tok.Literal = l.readNumber()
return tok
} else {
tok = newToken(token.ILLEGAL, l.ch)
}
}
// [...]
}
func (l *Lexer) readNumber() string {
position := l.position
for isDigit(l.ch) {
l.readRune()
}
return l.input[position:l.position]
}
func isDigit(ch rune) bool {
return '0' <= ch && ch <= '9'
}
As you can see, the added code closely mirrors the part concerned about reading identifiers and keywords.
The readNumber method is exactly the same as readIdentifier except for its usage of isDigit instead of
isLetter. We could probably generalize this by passing in the rune-identifying functions as arguments, but
wont, for simplicitys sake and ease of understanding.
The isDigit function is as simple as isLetter. It just returns whether the passed in rune is a Latin digit
between 0 and 9.
With this added, our tests pass:
% go test ./lexer
ok
monkey/lexer 0.008s
I dont know if you noticed, but we simplified things a lot in readNumber. We only read in integers. What
about floats? Or numbers in hex notation? Octal notation? We ignore them for now and just say that
Monkey doesnt support this. Of course, the reason for this is again the educational aim and limited scope
of this book.
11
Its time to pop the champagne and celebrate: we successfully turned the small subset of the Monkey language
we used in the our test case into tokens!
With this victory under our belt, its easy to extend the lexer so it can tokenize a lot more of Monkey source
code.
12
End Of Sample
Youve reached the end of the sample. I hope you enjoyed it. Youll soon be able to buy the full version of
the book (in multiple formats, including the code) online at:
http://interpreterbook.com
13