0% found this document useful (0 votes)
15 views54 pages

Unit 1 2

The document covers essential topics in programming languages, including lexical analysis, syntax analysis, and the distinction between static and dynamic scoping. It explains the roles of lexical analyzers, the concept of tokens, patterns, and lexemes, as well as parameter passing mechanisms and aliasing. Additionally, it discusses input buffering and the design of lexical analyzers using regular expressions for token recognition.

Uploaded by

Pramod Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views54 pages

Unit 1 2

The document covers essential topics in programming languages, including lexical analysis, syntax analysis, and the distinction between static and dynamic scoping. It explains the roles of lexical analyzers, the concept of tokens, patterns, and lexemes, as well as parameter passing mechanisms and aliasing. Additionally, it discusses input buffering and the design of lexical analyzers using regular expressions for token recognition.

Uploaded by

Pramod Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

UNIT-1 (Contd…)

TOPICS IN THIS SLIDE


• Programming Language Basics
• Lexical Analysis: - The role of lexical
analyzer, Input Buffering, Specification
of Tokens, Recognition of Tokens,
• Syntax Analysis: Introduction,
Context Free Grammar, Elimination of
left recursion and left factoring.
1.6 PROGRAMMING LANGUAGE
BASICS
1.6.1 The Static/Dynamic Distinction
• Compiler Decision-Making in Language
Design
• Key Issue: What decisions can be made by the
compiler about a program?
• Static Policy: Decisions are made at compile
time. The compiler determines behaviors before
the program runs.
• Dynamic Policy: Decisions are made at run
time. The program's behavior depends on its
execution context.
1.6 PROGRAMMING LANGUAGE BASICS
Scope of Declarations
• Scope Definition: The region in a program where a
declaration (e.g., of variable x) is valid.
• Static (Lexical) Scope:
• The scope of a declaration can be determined by examining
the code alone.
• Most languages (e.g., C and Java) use static scope.
• Dynamic Scope:
• The scope of a variable is determined during program
execution.
• The same use of a variable (x) might refer to different
declarations depending on the runtime context.
1.6 PROGRAMMING LANGUAGE BASICS
Static vs. Dynamic Distinction in Java Example
• Static Variables in Java:
• Example: public static int x;
• Declares x as a class variable—only one copy exists regardless
of how many objects are created.
• The compiler can determine the memory location of x at
compile time.
• Non-Static Variables in Java:
• If static is omitted:
• Each object of the class has its own copy of x.
• The compiler cannot pre-determine all the memory locations—
these are allocated at run time.
1.6.2 Environments and States
• Another important distinction we must make when discussing
programming languages is whether changes occurring as the
program runs affect the values of data elements or affect the
interpretation of names for that data.
• For example, the execution of an assignment such as x = y + 1
changes the value denoted by the name x. More specifically, the
assignment changes the value in whatever location is denoted by
x.
• The association of names with locations in memory (the store)
and then with values can be described by two mappings that
change as the program runs (see Fig. 1.8):
• The environment is a mapping from names to locations in the
store. Since variables refer to locations ('L1-values" in the
terminology of C), we could alternatively define an environment
as a mapping from names to variables.
• The state is a mapping from locations in store to their values. That
is, the state maps 1-values to their corresponding r-values, in the
terminology of C. Environments change according to the scope
rules of a language.
1.6.2 Environments and States
1.6.3 Static Scope and Block structure

‘a’ in B3 and ‘b’ in B2 are


accessed

Global ‘a’ and ‘b’ in B4 are


accessed
Global ‘a’ and ‘b’ in B2 are
accessed
Global ‘a’ and ‘b’ are accessed
1.6.4 EXPLICIT ACCESS CONTROL
Scope Introduced by Classes and Structures
• Class/Structure Members:
 Classes and structures introduce a new scope for their members.
 If p is an object of a class with a field x, the expression p.x refers to field x defined in
the class.
• Inheritance and Scope:
 The scope of a member x in class C extends to any subclass C'.
 However, if C' has its own local declaration of x, it overrides the member from class
C
2. Access Control in Object-Oriented Languages (C++/Java)
• Encapsulation via Access Modifiers:
 Object-oriented languages like C++ and Java use keywords to control access to class
members:
 public: Accessible from outside the class.
 protected: Accessible to subclasses.
 private: Accessible only within the class and, in C++, by friend classes.
 These access modifiers support encapsulation by restricting access to class
members.
1.6.5 Dynamic Scope
• Technically, any scoping policy is dynamic if it is based on factor(s) that can be known only when the program
executes. The term dynamic scope, however, usually refers to the following policy: a use of a name x refers to
the declaration of x in the most recently called procedure with such a declaration.
• The ‘a’ in function b() accesses local variable (x=1) and prints 1
• The ‘a’ in function c() accesses global variable (x=2) and prints 2.

prints 1
prints 2
1.6.6 Parameter Passing Mechanisms
• Actual Parameters  Used in the call of a procedure
• Formal Parameters  Used in the procedure definition
• Example:
main() { func(a,b)} void func(int a,
int b) { }
Actual Parameters Formal Parameters

1) Call by value
2) Call by reference
Call by value
• A copy of the actual parameter's value is passed to the
function.
• Changes made inside the function do not affect the
original variable.
• Example:
def The function modified only the copy, not the
modify(x):
original a.
x = x+5
print(“Inside
function:”, x)

a=10
modify(a)
print(“Outside function:”,
a)
Call by reference

• The function gets a reference (address) to the original


variable, not a copy.
• Changes made inside the function do affect the original
variable.
void reference (int *x) main()
{ {
*x=5; int a=10;
}
reference(&a);
}
• The value of original variable ‘a’ is modified inside the
function and not the copy.
1.6.7 ALIASING
• Aliasing happens when two or more variables refer to
the same memory location. This means that changing
one variable affects the other because they both point
to the same data.
#include <stdio.h>
int main() {
int a = 10;
int *ptr = &a; // ptr is an alias for a
prints 10
printf("Before change: a = %d\n", a);
*ptr = 20; // Changing value using the alias (ptr)
prints 20
printf("After change: a = %d\n", a);
return 0; }
• ptr is an alias for a because it stores the address of a.
• When we update *ptr, it also updates a because both refer to the same
memory location.
2. LEXICAL ANALYSIS
2.1 ROLE OF LEXICAL ANALYSER:
• As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the
source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the
source program. The stream of tokens is sent to the parser for syntax analysis
• Parsers asks for Token using getNextToken
2.1 ROLE OF LEXICAL ANALYSER
Lexical analyzers are divided into a cascade of two processes:
a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion
of comments and compaction of consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of
tokens as output.
Lexical Analysis Vs Parsing: Reasons for separating lexical analysis and parsing:
• Simplicity of design The separation of lexical and syntactic analysis often allows us to simplify at
least one of these tasks. For example, a parser that had to deal with comments and whitespace as
syntactic units would be considerably more complex than one that can assume comments and
whitespace have already been removed by the lexical analyzer. If we are designing a new language,
separating lexical and syntactic concerns can lead to a cleaner overall language design.
• Compiler efficiency is improved  A separate lexical analyzer allows us to apply specialized
techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering
techniques for reading input characters can speed up the compiler significantly
• Compiler portability is enhancedInput-device-specific peculiarities can be restricted to the lexical
analyzer.
2.1.1 TOKENS, PATTERNS, LEXEMES
• A token is like a label, lexeme is an instance of a token, and pattern is a rule which lexemes of a token should
match.
• Example

• One token for each keyword. The pattern for a keyword is the same as the keyword itself
• comparison is a token representing all the comparison operators like <=, >=, etc. The specific operators are
the lexemes belonging to the comparison token
• id (identifier) is a token representing all the identifiers which start from letters followed by letters and digits.
The “pi”, “score”, “D2” are the specific lexemes belonging to the identifier (id) token.
• number is a token matching all the numeric values like 3.14, etc. Each numeric value is a lexeme belonging
to the number token.
• literal is a token matching all the lexemes starting with “ and ending with”/
2.1.1 TOKENS, PATTERNS, LEXEMES
• A token is a pair consisting of a token name and an optional attribute value. The
token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier. The
token names are the input symbols that the parser processes.
• A pattern is a description of the form that the lexemes of a token may take. In the
case of a keyword as a token, the pattern is just the sequence of characters that
form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
• A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
• TOKEN represents category like operators, keywords, etc.
• PATTERN is a regular expression which describes lexemes
• LEXEME is the actual sequence of characters from the source code that
matches the pattern
2.1.1 TOKENS, PATTERNS, LEXEMES
• printf ("Total = %d\n”, score) ;
• both printf and score are lexemes matching the pattern for token id, and
• "Total = %d\n” is a lexeme matching literal token

• When multiple lexemes match a pattern, the lexical analyzer provides extra information to help
the compiler identify the specific lexeme.
• For example, both 0 and 1 match the number token pattern, but the exact lexeme found is
important for code generation.
• The lexical analyzer returns both the token name (for parsing) and an attribute value (to
describe the lexeme for later translation).
2.1.2 TOKENS, PATTERNS, LEXEMES
• Example 3.2 : The token names and associated attribute values for the Fortran
statement are written below as a sequence of pairs.
• E = M * C ** 2
In certain pairs, especially
operators, punctuation, and
keywords, there is no need for
an attribute value. In this
example, the token number has
been given an integer-valued
attribute. In practice, a typical
compiler would instead store a
character string representing the
constant and use as an attribute
value for number a pointer to
that string
2.1.2 LEXICAL ERRORS
• A lexical analyzer alone can't easily detect source-code errors without help
from other components.
• For example, encountering fi in a C program could be a misspelled if or an
undeclared identifier.

• Since fi is a valid identifier, the lexical analyzer returns it as id, leaving error
detection to the parser or later compiler phases.
• If no token pattern matches the remaining input, the lexical analyzer can't
proceed.
• In panic mode recovery, characters are deleted until a valid token is found.
This may confuse the parser but is often sufficient in interactive environments.
• Other functions performed by lexical analysis are keeping track of line
numbers, stripping out white spaces like redundant blanks and tabs, and
deleting comments.
2.2 LEXICAL ANALYSIS: INPUT BUFFERING
• Lexical Analyzer scans the characters of the source program to
discover tokens.
• But, many tokens have to be examined before the next token
itself can be determined.
• So lexical analyzer reads the input from an input buffer.

p r i n t f

Token Beginning Look ahead pointer


• One pointer marksthe beginning
pointer of the lexeme being discovered
• A lookahead pointer scans ahead of the beginning point, until
the lexeme matches a token. It points to character to be read
next.
2.2.1 BUFFER PAIRS
• If the lookahead pointer travels beyond the buffer half
in which it began, the other half must be loaded with
the next characters from source program.
• In Fig, buffer is of limited size N(like 4096 bytes).
• So, if lookahead travelled to the left and the left half of
the buffer is full, then the lexeme cannot be identified.
For example, a big keyword may not fit the entire buffer.
• So, preliminary scanning is required which removes
unnecessary spaces, comments, etc from the source
program.
• sentinel is a special character like eof which shows the
end of a lexeme.
FIRST BUFFER SECOND BUFFER
2.3 DESIGN OF A LEXICAL ANALYZER

FIRST BUFFER SECOND BUFFER


2.3 DESIGN OF LEXICAL ANALYZER
Regular Descriptions
Expressions
a* String of Zero or more a’s
Language L={ε, a, aa,
aaa, …}
a+ One or more a’s
L={a,aa,aaa,…}
a+b a or b
L={a, b}
a*b ab
L={ab}
(a+b)*c String of zero or more a’s
and b’s ending with c
L={c, ac, bc, abc, abbc,
…}
RECOGNITION OF
TOKENS:REGULAR EXPRESSIONS
• Patterns are expressed using regular
expressions.
• Regular expressions (regex) are patterns used
to match character combinations in strings.
• They are widely used in search operations,
data validation, and text processing in
programming languages like Python, Java, and
JavaScript.
REGULAR EXPRESSIONS
REGULAR MEANING
EXPRESSION
EXAMPLES S
OF a* Strings of 0 or more a’s. Language L={ε, a, aa, aaa,
…}
REGULAR a+ Strings of 1 or more a’s. Language L={a, aa, aaa, …}

EXPRESSIO a+b Strings of a or b; Language L={a, b}

NS a.b String of ab; Language L={ab}


Strings of a’s or b’s repeated 0 or more times
(a+b)*
Language L={ε, a, b, aa, bb, abb, ababbb, …}
(a+b)+ Strings of a’s or b’s repeated 1 or more times
Language L={a, b, aa, bb, abb, ababbb, …}
ab(a+b)* Strings of a’s and b’s starting with ab
Language L={ab, aba, abb, abab, abbb, …}
a*b*c* Strings of 0 or more a’s followed by 0 or more b’s
followed by 0 or more c’s;
L={ε, a, b, c, aa, bb, cc, abc, ab, …}
(aa)*(bb)* Strings of even number of a’s followed by even
number of b’s; L={ε, aa, bb, aabb, aaaabbbb, ….}
(0+1)*000 Strings of 0’s and 1’s ending with 000; L=(000,
01000,..}
2.4 RECOGNITION OF TOKENS
• Patterns are expressed using regular expressions.

Patterns
for tokens

• digit, digits, number, etc. are the tokens.


TRANSITION DIAGRAMS
• A transition diagram is a graphical representation of a
finite state machine (FSM) or automaton that
models how a system transitions from one state to
another based-on input symbols.
• It is widely used in compiler design, automata
theory, and regular expressions processing.
• Components of a transition diagram are states,
transitions, start state, and final (Accepting)
states
a b
• Example1 of transition 2 diagram for the regular
expression (Transition)
a.b* is:
(Start
(Final
state)
state)
REGULAR TRANSITION DIAGRAMS
EXPRESSION
S
a* 1

1 2
a+ a
a 2
a+b 1
b 3

a+ + b+
Construct transition diagrams for the
following regular expressions
1) digit  [0-9]
[0-
Ans: 1 9] 2

[0-
2) digits  digit+ [0-
9]
1 9] 2

3) number  digit.(digit)*
[0-

[0-
. 9]
1 9] 2 3
Construct transition diagrams for the following regular expressions

4) letter  [A-Z a-z] [A-Z a-


1 z] 2

5) id  letter (letter | digit)*


[A-Z a-z]
[A-Z a- [0-9]
1 z] 2

i f
1 2 3
6) if t
h e n
7) then e
4 5 6 7

8) else 8
l
9
s 1 e
11
0
Construct Transition diagram for the following:
relop  < | > | <= | >= | = | == | <> | !=
Ans:
2.5 TRANSITION DIAGRAMS

1) Design a transition diagram for a whitespace (ws)


given by
2.5 TRANSITION DIAGRAMS
3) Design a transition diagram for the keywords while,
for, do, exit, switch
Ans:

1 2 3 4 5
THE ROLE OF A PARSER
• The parser receives a string
of tokens from the lexical
analyzer and verifies if it
follows the source language
grammar.
• It reports syntax errors
clearly and attempts to
recover to continue
processing the program.
• For well-formed programs, the
parser constructs a parse
tree (explicitly or implicitly)
and passes it to the compiler
for further processing.
• The parser and the rest of the
front end may be
implemented as a single
module.
ROLE OF A PARSER
• There are three types of parsers: universal, top-
down, and bottom-up.
• Universal parsers can handle any grammar but are
too inefficient for production compilers.
• Top-down parsers build parse trees from the root to
the leaves, while bottom-up parsers build from the
leaves to the root.
• Both scan input left to right.
SYNTAX ERROR HANDLING

Common programming errors can occur at many different levels.


• Lexical errors include misspellings of identifiers, keywords, or operators -
e.g., the use of an identifier elipsesize instead of ellipsesize – and missing
quotes around text intended as a string.
• Syntactic errors include misplaced semicolons or extra or missing
braces; that is, '((" or ")." As another example, in C or Java, the
appearance of a case statement without an enclosing switch is a syntactic
error (however, this situation is usually allowed by the parser and caught
later in the processing, as the compiler attempts to generate code).
• Semantic errors include type mismatches between operators and
operands. An example is a return statement in a Java method with result
type void.
• Logical errors can be anything from incorrect reasoning on the part of
the programmer to the use in a C program of the assignment operator =
instead of the comparison operator ==. The program containing = may be
well formed; however, it may not reflect the programmer's intent.
SYNTAX ERROR HANDLING
The error handler in a parser has goals that
are simple to state but challenging to
realize:
• Report the presence of errors clearly and
accurately.
• Recover from each error quickly enough to
detect subsequent errors.
• Add minimal overhead to the processing of
correct programs.
ERROR RECOVERY STRATEGIES
• Once an error is detected, how should the parser
recover? Although no strategy has proven itself
universally acceptable, a few methods have broad
applicability.
• The error recovering strategies are:
1) Panic-Mode Recovery
2) Phrase level recovery
3) Error productions
4) Global correction
ERROR RECOVERY STRATEGIES
1)Panic Mode Recovery
• With this method, on discovering an error, the parser discards input symbols
one at a time until one of a designated set of synchronizing tokens is found.
• The synchronizing tokens are usually delimiters, such as semicolon or },
whose role in the source program is clear and unambiguous.
• The compiler designer must select the synchronizing tokens appropriate for
the source language.
• While panic-mode correction often skips a considerable amount of input
without checking it for additional errors, it has the advantage of simplicity,
and, unlike some methods to be considered later, is guaranteed not to go into
an infinite loop
Panic mode recovery example
Consider a simple arithmetic expression grammar:
E→E+T|T
T→T*F|F
F → (E) | id
If the input string is: id + * id
Here, + * is an error because an operand (e.g., id) is expected
after +, but * appears instead.
• Panic Mode Recovery Approach:
• When the parser detects an error at + *, it enters panic mode.
• It skips tokens until it finds a suitable synchronization point (e.g., id).
• Parsing resumes from id.
• This allows the parser to continue processing without stopping due to
the syntax error.
ERROR RECOVERY STRATEGIES
2) Phrase-Level Recovery
• On discovering an error, a parser may perform local correction on the remaining
input; that is, it may replace a prefix of the remaining input by some string that
allows the parser to continue.
• A typical local correction is to replace a comma by a semicolon, delete an
extraneous semicolon, or insert a missing semicolon. The choice of the local
correction is left to the compiler designer.
• Replacements must not lead to infinite loops.
• Phrase-level replacement has been used in several error-repairing compilers, as
it can correct any input string.
Phrase level recovery: Example
Consider a simple arithmetic expression grammar:
E→E+T|T
T→T*F|F
F → (E) | id
If the input string is: id + * id
Here, + * is an error because an operand (e.g., id) is expected after
+, but * appears instead.
Phrase-Level Recovery Approach:
• The parser detects the error at + *.
• It replaces * with a valid token, such as id, to maintain
grammatical correctness.
• By making a small correction at the phrase level, the parser can
continue parsing without skipping large portions of the input.
ERROR RECOVERY STRATEGIES
3) Error Productions
When we write code or use a programming language, it's common to make
mistakes, like missing a semicolon or using the wrong syntax. Error
productions are special rules added to the language's grammar to handle
these common mistakes.
• Why Add Error Productions?
By adding these rules, we help the parser (the part of the compiler that
reads and understands code) recognize when something is wrong. Instead of
crashing or giving a confusing error message, the parser can identify the
specific mistake and give a helpful message about what went wrong.
• How Does It Work?
• The grammar of a language is like a set of instructions that tell the parser
how correct code should look.
• Error productions are extra instructions that tell the parser, "If you see
something that looks like this common mistake, recognize it as an error."
• When the parser encounters this, it can stop and say something like, "Hey,
you forgot a semicolon here!" instead of just saying, "Syntax error."
ERROR RECOVERY EXAMPLE
Consider a simple arithmetic expression grammar: (without error
production)
E→E+T|T
T→T*F|F
F → (E) | id
If the input string is: id + * id (+ * is an error).
Modified Grammar with Error Production:
E→E+T|T
T→T*F|F
F → (E) | id
F → error // Error production to catch invalid tokens
• If the parser encounters + * instead of + id, it recognizes * as an error
using the F → error rule.
• The compiler can now generate a meaningful error message like:
"Syntax Error: Expected an operand after '+', but found '*' instead."
ERROR RECOVERY STRATEGIES

4) Global correction is about making the smallest possible


changes to fix incorrect code. Instead of making small local
fixes (as in phrase-level recovery), a global correction algorithm
analyzes the full input and makes the least number of
changes to create a valid sentence.
Working:
• Compare your incorrect code to the grammar rules of the
language.
• Find a correct version of your code that’s as close as possible
to what you wrote, with the least number of changes.Compile
r
• This could involve: corrects
• Inserting missing symbols (e.g., adding a semicolon). to
• Deleting extra or incorrect parts.
• Replacing wrong tokens with the right ones.
CONTEXT FREE GRAMMARS
• Grammars describe the syntax of programming language constructs like
expressions and statements.
• Formal definition: G=(V, T, P, S) where V is a set of variables, T is a set of
terminals, P is a set of productions, and S is the start symbol.
• Example CFG for simple arithmetic expression where G=(V, T, P, S)

Where,
V={expression, term, factor}
T={+, -, *, /, (, ), id}
S={expression}
DERIVATIONS, PARSE TREES
• Leftmost derivation (LMD) : is a step-by-step process
in which the leftmost non-terminal in a string is replaced
first at every step according to the grammar rules of a
language.

• Rightmost derivation (RMD) : is a step-by-step


process in which the rightmost non-terminal in a string
is replaced first at every step according to the grammar
rules of a language.

• A parse tree is a graphical representation of a derivation that filters


out the order in which productions are applied to replace non-
terminals.
CFG: DERIVATIONS  LMD, RMD, PARSE TREE
• Example-2. Derive –(id+id) using LMD, RMD for CFG.
Draw Parse corresponding trees

Ans: LMD: RMD:


E  -E E -E
 -(E)  -(E)
 -(E+E)  -(E+E)
 -(id+E)  -(E+id)
 -(id+id)
 -(id+id)
PROBLEMS ON LMD, RMD, PARSE TREES
Construct LMD, RMD, and the corresponding parse
trees for the following CFGs:
1) S → SS | aSb | ε for the string “abaabb”
2) S  aB | bA for the string “aaabbabbba”
A  aS | bAA | a
B  bS | aBB | b
3) E  I | E+E | E*E | (E) for the string
“(a101+b1)*(a1+b)”
I  a | b | Ia | Ib | I0 | I1
SOLUTION-1
Solution-1) S → SS | aSb | ε for the string “abaabb”
LMD: RMD:
S SS
SSS
 aSbS
SaSb
 aεbS SaaSbb
abaSb Saaεbb
abaaSbb  aSbaabb
abaaεbb aεbaabb
 abaabb
SOLUTION-2 for the string “aaabbabbba”
S  aB | bA
A  aS | bAA | a
B  bS | aBB | b
LMD: RMD:
SaB S  aB
 aaBB
 aaBB aaBbS
 aaaBBB aaBbbA
 aaabBB aaBbba
 aaabbB aaaBBbba
 aaabbaBB aaaBbbba
aaabSbbba
 aaabbabB
aaabbAbbb
 aaabbabbS
a
 aaabbabbbA aaabbabbb
 aaabbabbba a

You might also like