0% found this document useful (0 votes)
13 views17 pages

Lec 02

The document discusses the role and tasks of a lexical analyzer in a compiler, including reading input characters, producing tokens, and interacting with the parser and symbol table. It explains the distinction between scanning and lexical analysis, the importance of separating lexical analysis from parsing for design simplicity and efficiency, and the specification of tokens using regular expressions. Additionally, it covers input buffering techniques and the definition of terms related to strings and regular expressions.

Uploaded by

AMMAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views17 pages

Lec 02

The document discusses the role and tasks of a lexical analyzer in a compiler, including reading input characters, producing tokens, and interacting with the parser and symbol table. It explains the distinction between scanning and lexical analysis, the importance of separating lexical analysis from parsing for design simplicity and efficiency, and the specification of tokens using regular expressions. Additionally, it covers input buffering techniques and the definition of terms related to strings and regular expressions.

Uploaded by

AMMAD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Lexical Analysis

Chapter# 3
Instructor: Mr. Faiz Rasool
Email: [email protected]
The Role of the Lexical Analyzer
• The main task of the lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce as output a sequence of tokens for
each lexeme in the source program.
• The stream of tokens is sent to the parser for syntax analysis.
• It is common for the lexical analyzer to interact with the symbol table as well.
• When the lexical analyzer discovers a lexeme constituting an identifier, it needs to
enter that lexeme into the symbol table.
• In some cases, information regarding the kind of identifier may be read from the
symbol table by the lexical analyzer to assist it in determining the proper token it
must pass to the parser.
• The interaction is implemented by having the parser call the lexical analyzer.
• The call, suggested by the getNextToken command, causes the lexical analyzer to
read characters from its input until it can identify the next lexeme and produce for
it the next token, which it returns to the parser
Interactions between the lexical analyzer and
the parser
Certain other tasks of Lexical analyzer
• One such task is stripping out comments and whitespace (blank, newline, tab, and
perhaps other characters that are used to separate tokens in the input).
• Another task is correlating error messages generated by the compiler with the
source program.
• For instance, the lexical analyzer may keep track of the number of newline
characters seen, so it can associate a line number with each error message.
lexical analyzers are divided into a cascade of
two processes

• a) Scanning consists of the simple processes that do not require tokenization of the
input, such as deletion of comments and compaction of consecutive whitespace
characters into one.
• b) Lexical analysis proper is the more complex portion, where the scanner
produces the sequence of tokens as output.
Lexical Analysis Versus Parsing
• There are a number of reasons why the analysis portion of a compiler is
normally separated into lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration. The separation of
lexical and syntactic analysis often allows us to simplify at least one of these
tasks. For example, a parser that had to deal with comments and whitespace as
syntactic units would be considerably more complex than one that can assume
comments and whitespace have already been removed by the lexical analyzer.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing. In
addition, specialized buffering techniques for reading input characters can speed
up the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific peculiarities can be
restricted to the lexical analyzer.
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional attribute value. The
token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier. The
token names are the input symbols that the parser processes. In what follows, we
shall generally write the name of a token in boldface. We will often refer to a
token by its token name.
• A pattern is a description of the form that the lexemes of a token may take. In the
case of a keyword as a token, the pattern is just the sequence of characters that
form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
• A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the particular
lexeme that matched.
• For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was found in
the source program.
Lexical Errors
• It is hard for a lexical analyzer to tell, without the aid of other components, that
there is a source-code error.
• For instance, if the string fi is encountered for the first time in a C program in the
context: fi ( a == f(x) ) .. . a lexical analyzer cannot tell whether fi is a misspelling
of the keyword if or an undeclared function identifier.
• Since fi is a valid lexeme for the token id, the lexical analyzer must return the
token id to the parser and let some other phase of the compiler — probably the
parser in this case — handle an error due to transposition of the letters.
Input Buffering
• let us examine some ways that the simple but important task of reading the source program can
be speeded.
• This task is made difficult by the fact that we often have to look one or more characters beyond
the next lexeme before we can be sure we have the right lexeme.
• We shall introduce a two-buffer scheme that handles large lookaheads safely.
• We then consider an improvement involving "sentinels" that saves time checking for the ends of
buffers.
• Sentinel is special value which represent eof.
Buffer Pairs
• Because of the amount of time taken to process characters and the large number of
characters that must be processed during the compilation of a large source program,
specialized buffering techniques have been developed to reduce the amount of
overhead required to process a single input character. An important scheme involves
two buffers that are alternately reloaded

• Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters into a buffer, rather
than using one system call per character. If fewer than N characters remain in the
input file, then a special character, represented by eof, marks the end of the source
file and is different from any possible character of the source program.
Buffer Pairs
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
• Once the next lexeme is determined, forward is set to the character at its right end.
• Then, after the lexeme is recorded as an attribute value of a token returned to the
parser, lexemeBegin is set to the character immediately after the lexeme just
found.
• Advancing forward requires that we first test whether we have reached the end of
one of the buffers, and if so, we must reload the other buffer from the input, and
move forward to the beginning of the newly loaded buffer.
Specification of Tokens
• Regular expressions are an important notation for specifying lexeme patterns.
While they cannot express all possible patterns. They are very effective in
specifying those types of patterns that we actually need for tokens.
• An alphabet is any finite set of symbols. Typical examples of symbols are letters,
digits, and punctuation. The set {0,1} is the binary alphabet. ASCII is an important
example of an alphabet; it is used in many software systems.
• A string over an alphabet is a finite sequence of symbols drawn from that
alphabet. In language theory, the terms "sentence" and "word" are often used as
synonyms for "string.“
• The length of a string s, usually written |s|, is the number of occurrences of
symbols in s.
• The empty string, denoted e, is the string of length zero
Terms for Parts of Strings
1. A prefix of string s is any string obtained by removing zero or more symbols
from the end of s. For example, ban, banana, and e are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols
from the beginning of s. For example, nana, banana, and e are suffixes of
banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For
instance, banana, nan, and e are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not e or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not
necessarily consecutive positions of s. For example, baan is a subsequence of
banana.
Regular Expressions
1. e is a regular expression, and L(e) is {e}, that is, the language whose sole
member is the empty string.
2. If a is a symbol in E, then a is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with a in its one position.
• Note that by convention, we use italics for symbols, and boldface for their
corresponding regular expression.
• INDUCTION:
• There are four parts to the induction whereby larger regular expressions are built
from smaller ones. Suppose r and s are regular expressions denoting languages
L(r) and L(s), respectively.
INDUCTION
1. (r)|(s) is a regular expression denoting the language L(r) U L(s).
2. (r)(s) is a regular expression denoting the language L(r)L(s).
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r).
This last rule says that we can add additional pairs of parentheses around
expressions without changing the language they denote.
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.
Continue……….

You might also like