SE Compiler Chapter 5-Type Checking and Symbol Table
SE Compiler Chapter 5-Type Checking and Symbol Table
Type systems are the biggest point of variation across programming languages. Even languages that look
similar are often greatly different when it comes to their type systems.
Definition: A type system is a set of types and type constructors (integers, arrays, classes, etc.) along with the
rules that govern whether or not a program is legal with respect to types (i.e., type checking).
For example, C++ and Java have similar syntax and the control structures. They even have a similar set of
types (classes, arrays, etc.). However, they differ greatly with respect to the rules that determine whether or
not a program is legal with respect to types. As an extreme example, one can do this in C++ but not in Java:
In other words, Java’s type rules do not allow the above statement but C++’s type rules do allow it.
Why do different languages use different type systems? The reason for this is that there is no one perfect type
system. Each type system has its strengths and weaknesses. Thus, different languages use different type
systems because they have different priorities. A language designed for writing operating systems is not
appropriate for programming the web; thus they will use different type systems. When designing a type
system for a language, the language designer needs to balance the tradeoffs between execution efficiency,
expressiveness, safety, simplicity, etc.
The ability to grab a random word from memory and interpret it as if it were a specific kind of value (e.g.,
integer) is not useful most of the time even though it may be useful sometimes (e.g., when writing low-level
code such as device drivers). Thus even assembly language programmers use conventions and practices to
organize their program’s memory. For example, a particular block of memory may (by convention) be
treated as an array of integers and another block a sequence of instructions. In other words, even in a type-
less world, programmers try to impose some type structure upon their data. However, in a type-less world,
there is no automatic mechanism for enforcing the types. In other words, a programmer could mistakenly
write and run code that makes no sense with respect to types (e.g., adding 1 to an instruction). This mistake
may manifest itself as a crash when the program is run or may silently corrupt values and thus the program
produces an incorrect answer. To avoid these kinds of problems, most languages today have a rich type
system. Such a type system has types and rules that try to prevent such problems. Some researchers have also
proposed typed assembly languages recently; these would impose some kinds of type checking on assembly
language programs!
The rules in a given type system allows some programs and disallows other programs. For example, unlike
assembly language, one cannot write a program in Java that grabs a random word from memory and treats it
A boolean type is the set that contains True and False. An integer type is the set that contains all integers
from minint to maxint (where minint and maxint may be defined by the type system itself, as in Java, or by
the underlying hardware, as in C). A float type contains all floating point values. An integer array of length
2 is the set that contains all length-2 sequences of integers. A string type is the set that contains all strings.
For some types the set may be enumerable (i.e., we can write down all the members of the set, such as True
and False). For other types type set may not be enumerable or at least be very hard to enumerate (e.g.,
String).
When we declare a variable to be of a particular type, we are effectively restricting the set of values that the
variable can hold. For example, if we declare a variable to be of type Boolean, (i) we can only put True or
False into the variable; and (ii) when we read the contents of the variable it must be either True or False.
Base types int, float, double, char, bool, etc. These are the primitive types provided directly by
the underlying hardware. There may be a facility for user-defined variants on the base types (such as
C enums).
Compound types arrays, pointers, records, structs, unions, classes, and so on. These types are
constructed as aggregations of the base types and simple compound types.
Complex types lists, stacks, queues, trees, heaps, tables, etc. You may recognize these as abstract data
types. A language may or may not have support for these sort of higher-level abstractions.
Function declarations or prototypes serve a similar purpose for functions that variable declarations do for
variables. Function and method identifiers also have a type, and the compiler can use ensure that a program is
calling a function/method correctly. The compiler uses the prototype to check the number and types of
arguments in function calls. The location and qualifiers establish the visibility of the function (Is the function
global? Local to the module? Nested in another procedure? Attached to a class?) Type declarations (e.g., C
typedef, C++ classes) have similar behaviors with respect to declaration and use of the new typename.
Now that we have defined what a type is, we can define what it means to do type checking.
Definition: Type checking checks and enforces the rules of the type system to prevent type errors
from happening. (OR Type checking is the process of verifying that each operation executed in a
program respects the type system of the language.)
Definition: A type error happens when an expression produces a value outside the set of values it is
supposed to have.
Since s has type String, it can only hold values of type String. Thus, the expression on the right-hand-side of
the assignment must be of type String. However, the expression evaluates to the value 1234 which does not
belong to the set of values in String. Thus the above is a type error.
Since b has type Byte, it can only hold values from -128 to 127. Since the expression on the right-hand-side
evaluates to a value that does not belong to the set of values in Byte, this is also a type error.
Definition: Strong type checking prevents all type errors from happening. The checking may happen
at compile time or at run time or partly at compile time and partly at run time.
Definition: A strongly-typed language is one that uses strong type checking.
Definition: Weak type checking does not prevent type errors from happening.
Definition: A weakly-typed language is one that uses weak type checking.
Static type checking is done at compile-time. The information the type checker needs is obtained via
declarations and stored in a master symbol table. After this information is collected, the types involved in
each operation are checked. It is very difficult for a language that only does static type checking to meet the
full definition of strongly typed. Even motherly old Pascal, which would appear to be so because of its use of
declarations and strict type rules, cannot find every type error at compile time. This is because many type
errors can sneak through the type checker. For example, if a and b are of type int and we assign very large
values to them, a * b may not be in the acceptable range of ints, or an attempt to compute the ratio
between two integers may raise a division by zero. These kinds of type errors usually cannot be detected at
compile time. C makes a somewhat paltry attempt at strong type checking—things as the lack of array
bounds checking, no enforcement of variable initialization or function return create loopholes. The typecast
operation is particularly dangerous. By taking the address of a location, casting to something inappropriate,
dereferencing and assigning, you can wreak havoc on the type rules. The typecast basically suspends type
checking, which, in general, is a pretty risky thing to do.
Dynamic type checking is implemented by including type information for each data location at runtime. For
example, a variable of type double would contain both the actual double value and some kind of tag
indicating "double type". The execution of any operation begins by first checking these type tags. The
operation is performed only if everything checks out. Otherwise, a type error occurs and usually halts
execution. For example, when an add operation is invoked, it first examines the type tags of the two operands
to ensure they are compatible. LISP is an example of a language that relies on dynamic type checking.
Because LISP does not require the programmer to state the types of variables at compile time, the compiler
cannot perform any analysis to determine if the type system is being violated. But the runtime type system
takes over during execution and ensures that type integrity is maintained. Dynamic type checking clearly
comes with a runtime performance penalty, but it usually much more difficult to subvert and can report errors
that are not possible to detect at compile-time.
Here, constants represented by the tokens literal and num have type of char and integer.
Lookup(e) is used to fetch the type saved in the symbol table entry pointed to by e.
c. E → E1 mod E2 {E.type = if E1.type = integer and E2.type = integer Then integer else
type_error}
The expression formed by applying the mod operator to two sub-expressions of type integer has
type integer; otherwise, its type is type_error
d. E → E1[E2] { E.type = if E2.type = integer and E1. type = array(s,t) then t else
type_error}
In an array reference E1[E2], the index expression E2 must have type integer. The result is the
element type t obtained from the type array(s,t) of E1.
The postfix operator ↑ yields the object pointed to by its operand. The type of E ↑ is the type t of
the object pointed to by the pointer E.
a. Assignment Statement
b. Conditional Statements
c. While Statements
Example: Consider the following SDT and draw the annotated parse tree for the input (2 + 3) == 8
E → E1 + E2 {if ((E1.type = E2.type) and (E1.type = int)) then E.type = int else error;}
E → E1 = = E2 {if ((E1.type = E2.type) and (E1.type = int/bool)) then E.type = bool else error;}
E → (E1) {E.type = E1.type;}
E → num {E.type = int;}
E → true {E.type = bool;}
E → false {E.type = bool;}
Solution:
E.type = bool
E
E.type = int
E E.type = int == E
E E.type = int
E E.type = int +
num = 2 num = 3
Symbol table is an important data structure created and maintained by compilers in order to store information
about the occurrence of various entities such as variable names, function names, objects, classes, interfaces,
etc. Symbol table is used by both the analysis and the synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
· To store the names of all entities in a structured form at one place.
· To verify if a variable has been declared.
· To implement type checking, by verifying assignments and expressions in the source code are
semantically correct.
· To determine the scope of a name (scope resolution).
An attribute for a symbol in the source code is the information associated with that symbol. This information
contains the value, state, scope, and type about the symbol. The insert() function takes the symbol and its
attributes as arguments and stores the information in the symbol table.
For example:
int a;
insert(a, int);
lookup()
lookup() operation is used to search a name in the symbol table to determine:
· if the symbol exists in the table.
· if it is declared before it is being used.
· if the name is used in the scope.
· if the symbol is initialized.
· if the symbol declared multiple times.
The format of lookup() function varies according to the programming language. The basic format should
match the following:
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol exists in the
symbol table, it returns its attributes stored in the table.
delete(s):
deletes s from the table (or, typically, hides it).
· Each entry in a symbol table can be implemented as a record that consists of several fields.
· The entries in symbol table records are not uniform and depends on the program element identified
by the name.
· Some information about the name may be kept outside of the symbol table record and/or some fields
of the record may be left vacant for the reason of uniformity. A pointer to this information may be
stored in the record.
· The name may be stored in the symbol table record itself, or it can be stored in a separate array of
characters and a pointer to it in the symbol table.
a. Linear List
Advantage: It takes less space, and additions to the table are simple.
Disadvantage: It has a higher accessing time.
c. Hash Table