Haskell Tokenizer
Haskell Tokenizer
Before we can start implementing it, we have to define the Token data type and
learn more about Strings.
A data structure definition is introduced by the keyword data. Bool is the name
of the type we are defining. The right hand side of the equal sign lists
the constructors separated by vertical bars. When you create a new Bool value,
you use one of these two constructors. Constructor names must start with a
capital letter and must be unique per file (two data structures can't share the
same constructor name).
When you want to inspect a Bool value, you match it with one of the
constructors (remember, a value remembers how it was constructed). There are
several ways of matching values to constructors in Haskell. Let's start with the
simplest one: Defining a function using multiple equations. Instead of defining
a function with one equation, like this:
boolToInt :: Bool -> Int
boolToInt b = if b then 1 else 0
= 1
boolToInt False = 0
Patterns are matched in order, so when boolToInt is called with False, the
runtime first tries to match it to True and fails, so it moves to the second
pattern False and succeeds. (All equations for the same function must be
consecutive.)
(Note: In order to save on parentheses I will start using the function
application operator $that I introduced in the first tutorial. It's been a long
time, so here's a quick recap: $ separates a function call from its argument.
It's very useful when the argument is another function call, because function
calls bind to the left. In our example, without the $ or parenteheses, the
function calls would bind: (print boolToInt) False, and would fail to compile.
Operator $ has very low precedence so the thing to its right will be evaluated
before the function to the left is called, and it binds to the right.)
Here's a useful enumeration that we will use in our project:
data Operator = Plus | Minus | Times | Div
= '+'
= '/'
Token
Our tokenizer should recognize operators, identifiers, and numbers. We can
enumerate the four operators, but we can't enumerate all possible indentifiers
or numbers. For those tokens we need to store additional information:
a String and an Int respectively. Here's the definition ofToken:
data Token = TokOp Operator
| TokIdent String
| TokNum Int
deriving (Show, Eq)
All three constructors now take arguments. The TokOp constructor takes a value
of the typeOperator, TokIdent takes a String, and TokNum takes an Int. For
instance, you can create aToken using (TokIdent "x"), etc.
I'll explain the deriving clause in more detail when we talk about type classes.
For now it will suffice to know that deriving Show means that there is a way to
convert any Token to string (either by calling show or by print'ing it),
and deriving Eq means that we can compareTokens for (in-)equality. The
compiler is clever enough to implement this functionality all by itself (if it
can't, it will issue an error).
Pattern matching on these constructors is more interesting: We not only match
the constructor name but also the value with which it was originally called.
Here's a definition of a functionshowContent that uses this kind of pattern
matching:
-- show
data Token = TokOp Operator
| TokIdent String
| TokNum Int
deriving (Show, Eq)
token :: Token
token = TokIdent "x"
main = do
putStrLn $ showContent token
print token
-- /show
data Operator = Plus | Minus | Times | Div
deriving (Show, Eq)
= "+"
= "/"
In general, constructors may take many arguments of various types, and they
can all be matched by patterns.
Ex 2. Define a data type Point with one constructor Pt that takes two Doubles,
corresponding to the x and y coordinates of a point. Write a function inc that
takes a Point and returns a newPoint whose coordinates are one more than the
original coordinates. Use pattern matching.
data Point = Pt ...
deriving Show
p :: Point
p = Pt (-1) 3
p :: Point
p = Pt (-1) 3
By the way, we've seen pattern matching previously applied to pairs. The
constructor of a pair is (,).
Ex 3. Solve the previous exercise using pairs rather than Points.
inc :: (Int, Int) -> (Int, Int)
inc ... = ...
p :: (Int, Int)
p = ...
p :: (Int, Int)
p = (-1, 3)
The fact that this definition is recursive shouldn't bother us in the least. The
important thing is that it lets us create arbitrary lists:
lst0, lst1, lst2 :: List
lst0 = Empty
-- empty list
-- one-element list
-- two-element list
This definition can also be used in pattern matching. For instance, here's a
function that checks if a list is a singleton:
main = do
print $ singleton Empty
print $ singleton $ Cons 2 Empty
print $ singleton $ Cons 3 $ Cons 4 Empty
In this example, I made use of a wildcard pattern _. Let me remind you that his
pattern matches anything (without evaluating it). For instance, in the first
clause of singleton I'm discarding the integer stored in the list. In the second
clause I'm ignoring the whole list, because I know that the first clause, which
catches one-element lists, is tried first.
Most importantly, because list is defined recursively, it's easy to implement
recursive algorithms for it. For instance, to calculate the sum of all list elements
it's enough to say that the sum is equal to the first element plus the sum of the
rest. And, of course, the sum of an empty list is zero. So here we go:
data List = Cons Int List | Empty
main = do
print (sumLst lst)
print (sumLst Empty)
But you don't want to be defining a new list type for each possible element
type. Fortunately, static polymorphism in Haskell is embarassingly easy. No
need for the verbosetemplate<typename T> ugliness. You just parameterize types
by specifying a type argument. You may define a generic list by
replacing Int by a type parameter a (type parameters must start with lower
case and are typically taken from the beginning of the alphabet):
data List a = Cons a (List a) | Empty
List a in this definition is a generic type; List itself is called a type constructor,
because you can use it to construct a new type by providing a type argument,
as in List Int, orList (List Char) (a list of lists of characters). To avoid
confusion, the constructors on the right hand side of a data definition are often
called data constructors, as opposed to the type constructor on the left.
In reality, you don't need to define a list type -- its definition is built into the
language, and it's syntax is very convenient. The type name for a list consists
of a pair of square brackets with the type varaible between them; Cons is
replaced by an infix colon, :; and the Empty list is an empty pair of square
brackets, []. You may think of the built-in list type as defined by this equation:
data [a] = a : [a] | []
lst = [2, 4, 6]
main = do
print (sumLst lst)
print (sumLst [])
There is another convenient feature: special syntax for list literals. Instead of
writing a series of constructors, 2:8:64:[], you can write [2, 8, 64].
Pattern matching may be nested. For instance, you may match the first three
elements of a list with the pattern (a : (b : (c : rest))) or, taking advantage
of the right associativity of :, simply (a : b : c : rest).
Finally, this is the definition of String:
type String = [Char]
String comes with some syntactic sugar of its own: When defining string
literals, you can write"Hello" instead of the more verbose ['H', 'e', 'l', 'l',
'o'] .
Here, the type keyword introduces a type synonym (like the typedef in C). You can
always go back and treat a String as a list of Char -- in particular, you may
pattern match it like a list. We'll be doing a lot of this in the implementation
of tokenize. Type synonyms increase the readability of code and lead to better
error messages, but they don't create new types.
In the next tutorial we'll continue to work on the tokenizer and learn about
guards and touch upon currying.
Exercises
Ex 4. Implement norm that takes a list of Doubles and returns the square root
(sqrt) of the sum of squares of its elements.
norm :: [Double] -> Double
norm lst = undefined
Ex 5. Implement the function decimate that skips every other element of a list.
decimate :: [a] -> [a]
decimate = undefined
Ex 6. Implement a function that takes a pair of lists and returns a list of pairs.
For instance([1, 2, 3, 4], [1, 4, 9]) should produce [(1, 1), (2, 4), (3, 9)].
Notice that the longer of the two lists is truncated if necessary. Use nested
patterns.
zipLst :: ([a], [b]) -> [(a, b)]
zipLst = undefined
Incidentally, there is a two-argument function zip in the Prelude that does the
same thing:
main = print $ zip [1, 2, 3, 4] "Hello"