SPCC - 5
SPCC - 5
Compiler:
A compiler translates source code written in HLL to target code written in assembly language
or low level language. It reports the user about error(s) in the source program if any.
Phases of compiler:
Lexical Analysis:
● Lexical Analysis serves as the foundational phase in compiler design, acting as the first
step in the compilation process. It is commonly referred to as "lexer," "tokenizer," or
simply "scanner."
● Functions:
○ Reading Source Code: The lexer scans the source code, treating it as a series of
sentences.
○ Tokenization: It transforms sequences of characters into tokens, which are
fundamental units of language.
○ Whitespace and Comment Handling: The lexer removes excess spaces and
comments (e.g., // or #) from the source code.
○ Error Detection: Detects lexical errors, such as spelling mistakes, identifier length
violations, or illegal characters, during program execution.
○ Data Passing: Reads character streams from the source code, validates legal
tokens, and forwards data to the syntax analyzer upon request.
○ Symbol Table Population: Assists in identifying tokens and populating the symbol
table, aiding in further stages of compilation.
● Lexical analyzer reads the input source program, scans the characters and produces a
sequence of tokens that the parser can use for syntactic analysis.
● “get next token” is a command sent from the parser to the lexical analyzer.
● Language -
○ A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set
operations can be performed on them.
○ Finite languages can be described by means of regular expressions.
● Longest Match Rule -
○ When the lexical analyzer reads the source-code, it scans the code letter by letter;
and when it encounters a whitespace, operator symbol, or special symbols, it
decides that a word is completed.
○ int intvalue;
○ While scanning both lexemes till ‘int’, the lexical analyzer cannot determine
whether it is a keyword int or the initials of identifier int value.
● Complete set of tokens form the set of terminal symbols used in the grammar for the
parser. In most of the languages, the tokens fall into these categories-
○ Keywords
○ Operators
○ Identifiers
○ Constants
○ Literal strings
○ Punctuation
● Lexical analysis is the process of recognizing tokens from the input. Following are the
steps -
○ Store the input in the input buffer
○ The token is read and regular expressions are built for the corresponding token.
○ Regular expression is converted into a finite automaton.
○ For each state of FA, a function is designed and each input along with transitional
edges correspond to the input parameters of these functions.
○ The set of such functions ultimately creates lexical analyser program
● Functions of lexical analyser -
○ Tokenization.
○ Give only lexeme related Error messages: Exceeding length, unmatched string,
illegal characters. No messages related to syntax, semantics.
○ Eliminate comments, white spaces (Tab, blank space, newline).
● Lexical analyser uses transitional diagrams to keep track of information about the
characters that are seen as the forward pointer scans the input. Transition diagrams are
also called finite automata.
Lexemes, Tokens and Patterns:
● Lexemes consist of a series of characters within the source code that adheres to the
defined pattern of a token, essentially serving as an occurrence of that token.
● Tokens, on the other hand, are specific sequences of characters that encapsulate
meaningful units of information within the source program. These units can include
identifiers, keywords, constants (literals), operators, separators, and special characters.
● For instance, when considering a keyword, it operates as a token with a distinct pattern,
which is essentially a predefined sequence of characters representing that keyword. This
pattern is utilized to identify and categorize the keyword within the source code.
Symbol Table: The symbol table will contain the following types of information for the
input strings in a source program -
● The lexeme (input string) itself
● Corresponding token
● Its semantic component (e.g., variable, operator, constant, functions, procedure,
etc.)
● Data type
● Pointers to other entries (when necessary)
● The Symbol Table undergoes various operations, with key focus on the following pivotal
tasks:
○ Addition of Symbols - During initialization, the Symbol Table is populated with
reserved words, standard identifiers, and operators. As the scanner processes new
lexemes, they are dynamically added to the table and associated with a token
class. Furthermore, the semantic analyzer enriches these lexemes with pertinent
properties and attributes.
○ Organization - The Symbol Table can be structured in diverse ways, each with its
own advantages and drawbacks. A conventional approach involves organizing it as
an array of records. However, this method necessitates either a linear search for
retrieval or continual sorting to maintain orderliness.
● By optimizing the organization of the Symbol Table, compilers can streamline the lookup
process, thereby enhancing overall efficiency and performance.
● Some other ways of organizing the symbol table are -
○ Unordered List
○ Binary Search Tree
○ String table and name table
○ Hash table and name table
Syntax analysis:
● The Syntax Analyzer adheres to the production rules outlined by Context-Free Grammar
(CFG). Context-Free Grammar is formally represented as G(V, T, P, S), where:
○ V represents a set of non-terminal symbols.
○ T represents a set of terminal symbols, where the intersection of V and T is empty.
○ P denotes a set of rules, with each rule structured as P: V → (V ∪ T)*. In simpler
terms, the left-hand side of each production rule in P does not contain any
contextual dependencies, neither on its right nor left side.
○ S stands for the start symbol, signifying the initial non-terminal symbol from which
the derivation of valid strings commences.
● CFG serves as a foundational framework for defining the syntax and structure of
programming languages, providing a formalized structure for syntactic analysis by
compilers and parsers.
● A derivation tree or parse tree is an ordered rooted tree that graphically represents the
semantic information a string derived from a context-free grammar.
● Leftmost and Rightmost Derivation of a String-
○ Leftmost derivation − A leftmost derivation is obtained by applying production to
the leftmost variable in each step.
○ Rightmost derivation − A rightmost derivation is obtained by applying production
to the rightmost variable in each step.
● A grammar is said to ambiguous if for any string generated by it, it produces more than
one
○ Parse tree
○ Derivation tree
○ Syntax tree
○ Leftmost derivation
○ Rightmost derivation
● Ambiguous Grammar creates confusion for parser
● Left and Right Recursive Grammars-
○ In a context-free grammar G, if there is a production in the form X → Xa where X
is a non-terminal and ‘a’ is a string of terminals, it is called a left recursive
production. The grammar having a left recursive production is called a left
recursive grammar.
○ And if in a context-free grammar G, if there is a production in the form X → aX
where X is a non-terminal and ‘a’ is a string of terminals, it is called a right
recursive production. The grammar having a right recursive production is called a
right recursive grammar.
LL1 Parser:
● It is a non recursive predictive parsing.
● It is a top-down parser.
● In this,
○ L - Scanning input from left to right
○ L - Producing a leftmost derivation
○ 1- One input symbol of lookahead at each step
● A grammar is LL(1) iff whenever A → a|b:
○ First(a) ∩ First(b) = Φ
○ At most one of a and b can derive empty string
○ First(a) ∩ Follow(A) = Φ
● INPUT: Contains string to be parsed with $ as it's end marker
● STACK: Contains sequence of grammar symbols with $ as it's bottom marker. Initially
stack contains only $
● PARSING TABLE: A two dimensional array
● M[A,a], where A is a nonterminal and a is a Terminal
NOTEBOOK
Left Recursive Grammar:
● A production of grammar is said to have left recursion if the leftmost variable of its RHS
is the same as the variable of its LHS.
● A grammar containing a production having left recursion is called Left Recursive
Grammar. Elimination of Left Recursion.
● Left recursion can be eliminated by converting it to right recursion. Elimination of left
recursion -
○ A → Aα / β (where β does not begin with an A.)
○ We can eliminate it as follows -
■ A → βA’
■ A’ → αA’ / ∈
NOTEBOOK
Right Recursive Grammar:
● A production of grammar is said to have right recursion if the rightmost variable of its
RHS is the same as the variable of its LHS.
● A grammar containing a production having right recursion is called Right Recursive
Grammar.
● Right recursion does not create any problem for the Top down parsers. Therefore, there
is no need of eliminating right recursion from the grammar
Grammar with common prefixes:
● If RHS of more than one production starts with the same symbol, then such a grammar is
called as Grammar With Common Prefixes.
● A → αβ1 / αβ2 / αβ3
● This kind of grammar creates a problematic situation for Top down parsers.
● Top down parsers can not decide which production must be chosen to parse the string in
the right hand.
● To remove this confusion we use left factoring
● It converts non deterministic CFG into deterministic CFG
● We make one production for each common prefix.
● The common prefixes may be a terminal or a non-terminal or a combination of both.
● Rest of the derivation is added by new productions.
● The grammar obtained after the process of left factoring is called as Left Factored
Grammar
NOTEBOOK
Bottom Up parsers:
LR parser -
● It is a non-recursive shift reduce parser – LR(k)
○ L- Left to right scanning input stream
○ R- Construction of rightmost derivation which is in reverse manner
○ k- Denotes lookahead symbol to make decision
● An LR-Parser uses -
○ States to memorize information during the parsing process.
○ An action table to make decisions (such as shift or reduce) and to compute states.
○ A goto table to compute states.
● S-R and R-R conflicts -
● Advantages of LR parser -
○ LR parsers can handle a large class of context-free grammars.
○ The LR parsing method is a most general non-back tracking shift-reduce parsing
method.
○ An LR parser can detect the syntax errors as soon as they can occur.
○ LR grammars can describe more languages than LL grammars.
● Disadvantages of LR parser -
○ It is too much work to construct an LR parser by hand.
○ It needs an automated parser generator.
○ If the grammar contains ambiguities or other constructs then it is difficult to parse
in a left-to-right scan of the input.
NOTEBOOK
Operator Grammars:
● Operator grammars have the property that no production on the right side is empty (no
null productions) or has two adjacent nonterminals. This property enables the
implementation of efficient operator-precedence parsers.
● Rule 1-
○ If precedence of b is higher than precedence of a, then we define a < b If
precedence of b is same as precedence of a, then we define a = b If precedence of
b is lower than precedence of a, then we define a > b
● Rule 2-
○ An identifier is always given a higher precedence than any other symbol. $ symbol
is always given the lowest precedence.
● Rule 3-
○ If two operators have the same precedence, then we go by checking their
associativity.
E → E+E | ExE | id
NOTEBOOK
Semantic Analysis:
● Semantic analysis checks the semantic consistency of the code.
● It uses the syntax tree of the previous phase along with the symbol table to verify that
the given source code is semantically consistent.
● It also checks whether the code is conveying an appropriate meaning.
CFG + semantic rules = Syntax Directed Definitions
● The semantic analyzer is expected to recognize:
○ Type mismatch.
○ Undeclared variable.
○ Reserved identifier misuse.
○ Multiple declaration of variables in a scope.
○ Accessing an out of scope variable.
○ Actual and formal parameter mismatch.
● Functions of semantic analysis:
○ Helps you to store type information gathered and save it in symbol table or syntax
tree
○ Allows you to perform type checking
○ In the case of type mismatch, where there are no exact type correction rules
which satisfy the desired operation a semantic error is shown
○ Collects type information and checks for type compatibility
○ Checks if the source language permits the operands or not
Syntax directed translation:
● In syntax directed translation, along with the grammar we associate some informal
notations and these notations are called semantic rules.
● In syntax directed translation, every non-terminal can get one or more than one
attribute or sometimes 0 attribute depending on the type of the attribute. The value of
these attributes is evaluated by the semantic rules associated with the production rule.
● In the semantic rule, an attribute is VAL and an attribute may hold anything like a string,
a number, a memory location and a complex record.
● In Syntax directed translation, whenever a construct is encountered in the programming
language then it is translated according to the semantic rules defined in that particular
programming language.
Syntax directed translation scheme:
● The Syntax directed translation scheme is a context -free grammar.
● The syntax directed translation scheme is used to evaluate the order of semantic rules.
In the translation scheme, the semantic rules are embedded within the right side of the
productions.
● The position at which an action is to be executed is shown by enclosed between braces.
It is written within the right side of the production.
● Annotated Parse Tree – The parse tree containing the values of attributes at each node
for given input string is called annotated or decorated parse tree
● Features –
○ High level specification
○ Hides implementation details
○ Explicit order of evaluation is not specified
Implementation of SDT:
● Syntax directed translation is implemented by constructing a parse tree and performing
the actions in a left to right depth first order.
● SDT is implemented by parsing the input and produces a parse tree as a result.