UNIT-1 Objective:: Overview of A Language-Processing System
UNIT-1 Objective:: Overview of A Language-Processing System
Objective:
To familiarize with lexical analyzer.
Syllabus:
Lexical analysis Overview of language processing, preprocessors, compiler,
assembler, interpreters, linkers & loaders, phases of a compiler. Lexical Analysis-
Role of lexical analysis, lexical analysis vs parsing, token, patterns and lexemes,
lexical errors, transition diagram for recognition of tokens, reserved words and
identifiers.
Learning Outcomes:
Students will be able to
enumerate language processing system.
identify the differences between compiler and Interpreter.
design a Lexical analyzer for the given language.
Learning Material
Overview of A language-processing System
Pre-processor:
A pre-processor is a program that processes its input data to produce output that is
used as input to another program.
Skeletal Source program Source program
Preprocessor
Functions of pre-processor:
1. Macro processing: A pre-processor may allow a user to define macros that
are short hands for longer constructs.
2. File inclusion: A pre-processor may include header files into the program
text.
void main()
{
printf("value of height : %d \n", height );
printf("value of number : %f \n", number );
printf("value of letter : %c \n", letter );
printf("value of letter_sequence : %s \n", letter_sequence);
printf("value of backslash_char : %c \n", backslash_char);
}
Output:
value of height : 100
value of number : 3.140000
value of letter : A
value of letter_sequence : ABC
value of backslash_char : ?
Compiler:
A compiler is a computer program that reads a program written in one
language -the source language-and translates it into an equivalent program
in another language.
An important part of a compiler is it presents error information to the user.
DEPARTMENT OF INFORMATION TECHNOLOGY GEC Page 2
COMPILER DESIGN UNIT-1
Assembler
Interpreter:
An interpreter is another common kind of language processor, instead of
producing a target program as a translation; it appears directly to execute
Compiler Interpreter
Compiler Scans the entire program Interpreter scans the program line by
first and then translates it into an line .
equivalent machine code.
Compiled programs take more Interpreted programs take less memory
memory because the entire program because at a time a line of code will
has to reside in memory. reside in memory.
A compiled language is more difficult Debugging is easy because interpreter
to debug stops and reports errors as it encounter
them.
Linker:
A linker or link editor is a computer program that takes one or more object
files generated by a compiler and combines them into a single executable
file, library file, or another object file.
The Linker resolves external memory addresses, where the code in one file
may refer to code in another file.
Link editors are commonly known as linkers. The compiler automatically
invokes the linker as the last step in compiling a program. The linker inserts
code (or maps in shared libraries) to resolve program library references,
and/or combines object modules into an executable image suitable for
loading into memory.
Static linking is the result of the linker copying all library routines used in
the program into the executable image. This may require more disk space
and memory, but is both faster and more portable, since it does not require
the presence of the library on the system where it is run.
Dynamic linking is accomplished by placing the name of a sharable library
in the executable image. Actual linking with the library routines does not
occur until the image is run, when both the executable and the library are
placed in memory. An advantage of dynamic linking is that multiple
programs can share a single copy of the library.
If linker does not find a library of a function then it informs to compiler and
then compiler generates an error.
Usually a longer program is divided into smaller subprograms called
modules. And these modules must be combined to execute the program.
The process of combining the modules is done by the linker.
Linker can convert machine understandable format into Operating system
understandable format.
Loader
The loader puts together all of the executable object files into memory for
execution.
Relocating loaders
Some operating systems need relocating loaders, which adjust
addresses (pointers) in the executable to compensate for variations in
the address at which loading starts.
The operating systems that need relocating loaders are those in which
a program is not always loaded into the same location in the address
space and in which pointers are absolute addresses rather than offsets
from the program's base address.
Linking and loading provides 4 functions
1. Allocation
2. Relocation
3. linking
4. loading
For example for the statement below , the symbol table entries are shown
below
Error handler:
Each phase encounters errors.
After detecting an error, a phase must some how deal with that error, so that
compilation can proceed, allowing further errors in the source program to
be detected.
Lexical analysis phase can detect errors that do not form any token of the
language.
Syntax analysis phase can detect the token stream that violates the structure
(or) syntax rules of the language.
Semantic analysis phase detects the constructs that have no meaning to the
operation involved.
Lexical analysis
Lexical analysis is the first phase of a compiler.
Lexical analyzer is also called Scanner.
The lexical analysis phase reads the characters from the source program and
group them into stream of tokens in which each token represents a logically
cohesive sequence of characters, such as an identifier, a keyword (if, while,
etc.,), a punctuation character etc.,
For example in the statement position := initial + rate * 60 would be
grouped into the following tokens:
The identifier 1 - position.
The assignment symbol - : =.
Semantic analysis
This phase checks the source program for semantic errors and gathers type
information for the subsequent code-generation phase.
It uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements
An important component of semantic analysis is type checking.
Syntax trees after semantic analysis phase for the example statement
position := initial + rate * 60
Code optimization
Code optimization phase attempts to improve the intermediate code, so that
faster-running machine code will result.
Optimized Three address code after Code Optimization phase for the
example statement
position := initial + rate * 60
Code generation
The final phase of the compiler is the generation of target code, consisting
of relocatable machine code or assembly code.
Memory locations are selected for each of the variables used by the
program.
Then, each intermediate instruction is translated into a sequence of machine
instructions that perform the same task.
A crucial aspect is the assignment of variables to registers.
Example:
Pass
Grouping of several phases of compilation is called a pass.
Phase
Phase is a logical entity to perform a particular task.
Pass Phase
Pass requires more space. Phase requires less space.
Single Pass takes more time for Single Phase takes more time for
execution. execution.
Another task of lexical analyzer is stripping out comments and white space
in the form of blank, tab and newline characters from the source program.
Correlating error messages from the compiler with the source program.
The lexical analyzer may keep track of the number of newline characters
seen, so that line number can be associated with an error message.
In some compilers, the lexical analyzer is in charge of making a copy of the
source program with the error messages marked in it.
If the lexical analyzer finds a token invalid, it generates an error.
The lexical analyzer works closely with the syntax analyzer. It reads
character streams from the source code, checks for legal tokens, and passes
the data to the syntax analyzer when it demands.
The lexical analyzer collects information about tokens into their associated
attributes.
Token:
Token is a sequence of characters that can be treated as a single logical
entity.
Typical tokens are:
1) Identifiers 2) keywords 3) operators 4) special symbols
5) Constants
Pattern:
A rule that describes the set of strings associated to a token.
Expressed as a regular expression and describing how a particular token can
be formed. For example, [A-Z a-z][A-Z a-z _ 0-9] *
Lexeme:
A lexeme is a sequence of characters in the source program that is matched
by the pattern for a token.
Each lexeme corresponds to a token.
Specification of Tokens
Regular expressions are an important notation for specifying lexeme patterns.
Regular expressions
The languages accepted by finite automata are easily described by simple
expressions called regular expressions.
Let Σ be an alphabet. The regular expressions over Σ and the sets that they denote
are defined recursively as follows.
1) Ø is a regular expression and denotes the empty set.
2) ε is a regular expression and denotes the set { ε }.
3) For each a in Σ, a is a regular expression and denotes the set {a}.
4) If r and s are regular expressions denoting the languages R and S, respectively,
Then (r + s), (rs), and (r*) are regular expressions that denote the sets R U S,
RS, and R*, respectively.
Example:
Regular expression for pascal identifier
Letter ( letter |digit)*
Regular Definition
If Σ is an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form
d1 -> r1
d2 -> r2
...
dn-> rn
Where each di is a distinct name and each ri is a regular expression over the
symbols in
Σ ∪ {d1, d2,.... dn},
Examples:
Regular expression for identifiers in PASCAL
Recognition of Tokens
How to take the patterns for all the needed tokens and build a piece of code that
examines the input string and finds a prefix that is a lexeme matching one of the
patterns.
Example:
t h e n nonletter/digit
Lexical errors:
It is hard for a lexical analyzer to tell, without the aid of other components,
that there is a source-code error.
For instance, if the string fi is encountered for the first time in a C program
in the context:
f i ( a == f ( x ) ) . ..
A lexical analyzer cannot tell whether f i is a misspelling of the keyword if
or an undeclared function identifier.
Since f i is a valid lexeme for the token id, the lexical analyzer must return
the token id to the parser and let some other phase of the compiler —
probably the parser in this case — handle an error due to transposition of
the letters.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Assignment-Cum-Tutorial Questions
A B C D
a) 4 3 2 1
b) 3 4 1 2
c) 4 3 1 2
d) 4 2 3 1
11. r+ represents _________________________.
II)Problems
1. Write the output at all phases of a compiler for the statement x=a+b*c
2. Construct the transition diagram for identifiers in C.
3. Construct syntax tree for the expression a=b*-c+b*-c.
4. Identify the lexemes that make up the tokens in the following program
segment. Indicate corresponding token and pattern
void swap(int i, int j)
{
int t;
t=i;
i=j;
j=t;
}
5. Differentiate between pass and phase of a compiler.
6. Differentiate between Compiler and Interpreter.
7. Construct transition diagram for relational operators in C.
8. Give regular expression for unsigned numbers in C.
Assignment-Cum-Tutorial Questions
I) Objective Questions
7. Assembly language__________ [
]
a. is usually the primary user interface
b. requires fixed format commands
c. is a mnemonic form of machine language
d. is quite different from the SCL interpreter
A B C D
a) 4 3 2 1
b) 3 4 1 2
c) 4 3 1 2
d) 4 2 3 1
5. The lexical analysis for a modern computer language such as Java needs the
[ ]
power of which one of the following machine models in a necessary and
sufficient sense?
a.Finite state automata b.Deterministic pushdown automata
c.Non-Deterministic pushdown automata d.Turing Machine
}
a.23 b. 20 c. 25 d. 19
7. Find number of tokens in the following statement
[ ]
printf(“i=%d,&i=%x”,i,&i);
a.19 b. 10 c. 22 d. 20
II)Problems
1. Write the output at all phases of a compiler for the following statement
x=a+b*c
2. Construct the transition diagram for identifiers in C.
3. Construct syntax tree for the expression a=b*-c+b*-c.
4. Identify the lexemes that make up the tokens in the following program
segment. Indicate corresponding token and pattern
void swap(int i, int j)
{
int t;
t=i;
i=j;
j=t;
}
5. Differentiate between pass and phase of a compiler.
6. Differentiate between Compiler and Interpreter.
7. Construct transistion diagram for relational operators in C.
8. Give regular expression for unsigned numbers in C.