Compiler - Design - Notes - Unit - 1 - Part 1
Compiler - Design - Notes - Unit - 1 - Part 1
Unit 1
Introduction to Compiler: Phases and passes, Bootstrapping, Finite state machines and regular
expressions and their applications to lexical analysis, Optimization of DFA-Based Pattern
Matchers implementation of lexical analyzers, lexical-analyzer generator, LEX compiler, Formal
grammars and their application to syntax analysis, BNF notation, ambiguity, YACC. The
syntactic specification of programming languages: Context free grammars, derivation and parse
trees, capabilities of CFG.
Introduction to Compiler
o A compiler is a translator that converts the high-level language into the machine
language.
o High-level language is written by a developer and machine language can be understood
by the processor.
o Compiler is used to show errors to the programmer.
o The main purpose of compiler is to change the code written in one language without
changing the meaning of the program.
o When we execute a program which is written in HLL programming language then it
executes into two parts.
o In the first part, the source program compiled and translated into the object program (low
level language).
o In the second part, object program translated into the target program through the
assembler.
Role of Compilers
A program written in a high-level language cannot run without compilation. Each
programming language has its own compiler, but the fundamental tasks performed by all
compilers remain the same. Translating source code into machine code involves multiple
stages, such as lexical analysis, syntax analysis, semantic analysis, code generation, and
optimization.
While compilers are specialized, they differ from general translators. A translator or language
processor is a tool that converts an input program written in one programming language into
an equivalent program in another language.
Language Processing Systems
We know a computer is a logical assembly of Software and Hardware. The hardware knows a
language, that is hard for us to grasp, consequently, we tend to write programs in a high-level
language, that is much less complicated for us to comprehend and maintain in our thoughts.
Now, these programs go through a series of transformations so that they can readily be used
by machines. This is where language procedure systems come in handy.
Types of Compiler
Self/Native Compiler: When the compiler runs on the same machine and produces
machine code for the same machine on which it is running then it is called as self
compiler or resident compiler. Example- Turbo C or GCC compiler. if a compiler runs on
a Windows machine and produces executable code for Windows, it is a native compiler.
Cross Compiler: The compiler may run on one machine and produce the machine codes
for other computers then in that case it is called a cross-compiler. It is capable of creating
code for a platform other than the one on which the compiler is running. Example- a
compiler that runs on Linux/x86 box is building a program that will run on a separate
Arduino/ARM. if a compiler runs on a Linux machine and produces executable code for
Windows, then it is a cross-compiler.
Source-to-Source Compiler: A Source-to-Source Compiler or transcompiler or transpiler
is a compiler that translates source code written in one programming language into the
source code of another programming language.
Single Pass Compiler: When all the phases of the compiler are present inside a single
module, it is simply called a single-pass compiler. It performs the work of converting
source code to machine code.
Two Pass Compiler: Two-pass compiler is a compiler in which the program is translated
twice, once from the front end and the back from the back end known as Two Pass
Compiler.
Multi-Pass Compiler: When several intermediate codes are created in a program and a
syntax tree is processed many times, it is called Multi-Pass Compiler. It breaks codes into
smaller programs.
Just-in-Time (JIT) Compiler: It is a type of compiler that converts code into machine
language during program execution, rather than before it runs. It combines the benefits of
interpretation (real-time execution) and traditional compilation (faster execution).
Ahead-of-Time (AOT) Compiler: It converts the entire source code into machine code
before the program runs. This means the code is fully compiled during development,
resulting in faster startup times and better performance at runtime.
Incremental Compiler: It compiles only the parts of the code that have changed, rather
than recompiling the entire program. This makes the compilation process faster and more
efficient, especially during development.
Phases of a Compile
There are two major phases of compilation, which in turn have many parts. Each of them
takes input from the output of the previous level and works in a coordinated way.
Analysis Phase
An intermediate representation is created from the given source code :
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Intermediate Code Generator
Synthesis Phase
An equivalent target program is created from the intermediate representation. It has two parts
:
Code Optimizer
Code Generator
Phases of Compiler
In a compiler, there are the following six phases:
Points to Remember
A token is a sequence of characters representing a lexical unit matching with patterns such as
operators, identifiers and keywords.
A lexeme is a sequence of characters in the source program that matches the patterns for a
token. Simply, it is an instance of a token.
A pattern describes the rule lexemes take, for keyword as a token, the pattern is the sequence of
characters forming the keyword, for identifiers it is matched by strings.
Lexical analysis: This is the first phase of compiler which converts high level input
programs into sequence of tokens. These tokens are sequence of characters that are treated
as a unit in grammar of the programming language. It can be implemented with
Deterministic finite Automata. The output is a sequence of tokens sent to parser for syntax
analysis.
Syntactic analysis or Parsing: Also known as syntax analysis; this is the second phase
where provided input string is scanned for validation of structure of standard grammar. The
syntactical structure is analyzed and inspected to check whether the given input is correct in
terms of programming syntax.
Semantic analysis: In the third phase, the compiler scans whether parse tree follows the
guidelines of language. The compiler keeps track of identifiers and expressions. A semantic
analyzer defines the validity of parse tree and an annotated syntax tree is the output.
Intermediate code generation: Once the parse tree is semantically confirmed, an
intermediate code generator develops three address codes. During the time of the translation
of source program into object code, a middle-level language code is generated by the
compiler.
Code optimizer: This is an optional phase of compiler that is used for optimizing the
intermediate code. Through this optimization, the program runs fast and consumes less
space. To increase the program speed, unnecessary strings of code are eliminated and
sequences of statements are organized.
Code generation: It is the final phase of compiler where the compiler acquires fully
optimized intermediate code as input and maps it to machine code. During this phase, the
intermediate code is translated into machine code.
Compiler Passes
A Compiler pass refers to the traversal of a compiler through the entire
program. Compiler passes are of two types Single Pass Compiler, and Two Pass
Compiler or Multi-Pass Compiler. These are explained as follows.
Types of Compiler Pass
1. Single Pass Compiler
If we combine or group all the phases of compiler design in a single module known as a
single pass compiler.
In the above diagram, there are all 6 phases are grouped in a single module, some points of
the single pass compiler are as:
A one-pass/single-pass compiler is a type of compiler that passes through the part of each
compilation unit exactly once.
Single pass compiler is faster and smaller than the multi-pass compiler.
A disadvantage of a single-pass compiler is that it is less efficient in comparison with the
multi-pass compiler.
A single pass compiler is one that processes the input exactly once , so going directly
from lexical analysis to code generator, and then going back for the next read.
Note: Single pass compiler almost never done, early Pascal compiler did this as an
introduction.
Problems with Single Pass Compiler
We can not optimize very well due to the context of expressions are limited.
As we can’t back up and process, it again so grammar should be limited or simplified.
Command interpreters such as bash/sh/tcsh can be considered Single pass compilers, but
they also execute entries as soon as they are processed.
2. Two-Pass compiler or Multi-Pass compiler
A Two pass/multi-pass Compiler is a type of compiler that processes the source
code or abstract syntax tree of a program multiple times. In multi-pass Compiler, we divide
phases into two passes as:
Bootstrapping
It is an approach for making a self-compiling compiler that is a compiler written in the source
programming language that it determine to compile. A bootstrap compiler can compile the
compiler and thus we can use this compiled compiler to compile everything else and the future
versions of itself.
1. Source Language
2. Target Language
3. Implementation Language
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L,
which runs on machine A and produces code for machine A.
Cross Compiler
A compiler is characterized by three languages as its source language, its object language, and
the language in which it is written. These languages may be quite different. A compiler can run
on one machine and produce target code for another machine. Such a compiler is known as a
cross-compiler.
If it can write a cross-compiler for a new language 'L' in execution languages 'S' to generate a
program for machine 'N'. i.e. LsN
If a current compiler for S runs on machine M and generates a program for M, it is defined by
SMM.
If LSN runs through SMM, we get a compiler LMN, i.e., a compiler from L to N that runs on M.
Example − Create a cross compiler using bootstrapping when SsM runs on SAA.
1. δ: Q x ∑ →Q
FA is characterized into two ways:
DFA
DFA stands for Deterministic Finite Automata. Deterministic refers to the uniqueness of the
computation. In DFA, the input character goes to one state only. DFA doesn't accept the null
move that means the DFA cannot change state without any input character.
Example
See an example of deterministic finite automata:
NDFA also has five states same as DFA. But NDFA has different transition function.
Transition function of NDFA can be defined as:
δ: Q x ∑ →2Q
Example
See an example of non deterministic finite automata:
Regular expression
o Regular expression is a sequence of pattern that defines a string. It is used to denote
regular languages.
o It is also used to match character combinations in strings. String searching algorithm used
this pattern to find the operations on string.
o In regular expression, x* means zero or more occurrence of x. It can generate {e, x, xx,
xxx, xxxx,.....}
o In regular expression, x+ means one or more occurrence of x. It can generate {x, xx, xxx,
xxxx,.....}
Union: If L and M are two regular languages then their union L U M is also a union.
1. L U M = {s | s is in L or s is in M}
Intersection: If L and M are two regular languages then their intersection is also an intersection.
1. L ⋂ M = {st | s is in L and t is in M}
Kleene closure: If L is a regular language then its kleene closure L1* will also be a regular
language.
Example
Write the regular expression for the language:
L = {abnw : n ≥ 3, w ∈ (a,b)+}
Solution:
The string of language L starts with "a" followed by atleast three b's. It contains at least one "a"
or one "b" that is string are like abbba, abbbbbba, abbbbbbbb, abbbb.....a
r= ab3b* (a+b)+
Example 1: Write the regular expression for the language accepting all combinations of a's,
over the set ∑ = {a}
Solution: All combinations of a's means 'a' may be zero, single, double and so on. If a is
appearing zero times, that means a null string. That is we expect the set of {ε, a, aa, aaa, ....}. So
we give a regular expression for this as:
R = a*
That is Kleen closure of a.
Example 2: Write the regular expression for the language accepting all combinations of a's
except the null string, over the set ∑ = {a}
Solution: The regular expression has to be built for the language L = {a, aa, aaa, ....}. This set
indicates that there is no null string. So we can denote regular expression as:
R = a+
Example 3: Write the regular expression for the language accepting all the string
containing any number of a's and b's.
Example 4: Write the regular expression for the language accepting all the string which are
starting with 1 and ending with 0, over ∑ = {0, 1}.
Solution: In a regular expression, the first symbol should be 1, and the last symbol should be 0.
The R.E. is as follows:
R = 1 (0+1)* 0
Example 5: Write the regular expression for the language starting and ending with a and
having any having any combination of b's in between.
Example 6: Write the regular expression for the language starting with a but not having
consecutive b's.
Solution: The regular expression has to be built for the language: L = {a, aba, aab, aba, aaa,
abab, .....}. The regular expression for the above language is:
R = {a + ab}*
Example 7: Write the regular expression for the language accepting all the string in which
any number of a's is followed by any number of b's is followed by any number of c's.
Solution: As we know, any number of a's means a* any number of b's means b*, any number of
c's means c*. Since as given in problem statement, b's appear after a's and c's appear after b's. So
the regular expression could be:
R = a*b*c*
Regular expressions are fundamental in lexical analysis, the initial phase of a compiler that transforms
source code into tokens. They define patterns for valid tokens—such as keywords, identifiers, literals, and
operators—enabling the lexical analyzer to recognize and categorize these elements.
1. Token Definition: Regular expressions specify the structure of tokens. For example, an
identifier in many programming languages can be defined as a sequence starting with a
letter or underscore, followed by letters, digits, or underscores. This pattern can be
expressed as:
[a-zA-Z_][a-zA-Z0-9_]*
Example:
In this example, each regular expression pattern corresponds to a specific token type. The lexical
analyzer uses these patterns to identify and categorize parts of the input source code.
Regular expressions are essential in lexical analysis for defining token patterns, generating finite
automata for recognition, and facilitating the use of tools that automate the creation of lexical
analyzers.
4 Computing followpos
There are three algorithms that have been used to implement and optimize pattern matchers
constructed from regular expressions.
The first algorithm is useful in a Lex compiler, because it constructs a DFA directly from
a regular expression, without constructing an intermediate NFA. The resulting DFA also
may have fewer states than the DFA constructed via an NFA.
The second algorithm minimizes the number of states of any DFA, by combining states
that have the same future behavior. The algorithm itself is quite efficient, running in
time 0(n log n), where n is the number of states of the DFA.
The third algorithm produces more compact representations of transition tables than the
standard, two-dimensional table.
When the NFA is constructed from a regular expression, we can say more about
the important states. The only important states are those introduced as initial states in the basis
part for a particular symbol position in the regular expression. That is, each important state
corresponds to a particular operand in the regular expression.
The constructed NFA has only one accepting state, but this state, having no out-
transitions, is not an important state. By concatenating a unique right endmarker # to a regular
expression r, we give the accepting state for r a transition on #, making it an important state of
the NFA for ( r ) # . In other words, by using the augmented regular expression ( r ) # , we can
forget about accepting states as the subset construction proceeds; when the construction is
complete, any state with a transition on # must be an accepting state.
The important states of the NFA correspond directly to the positions in the
regular expression that hold symbols of the alphabet. It is useful, as we shall see, to present the
regular expression by its syntax tree, where the leaves correspond to operands and the interior
nodes correspond to operators. An interior node is called a cat-node, or-node, or star-node if it is
labeled by the concatenation operator (dot), union operator |, or star operator *, respectively. We
can construct a syntax tree for a regular expression just as we did for arithmetic expressions
Example: Figure 3.56 shows the syntax tree for the regular expression of our running example.
Cat-nodes are represented by circles. •
Leaves in a syntax tree are labeled by e or by an alphabet symbol. To each leaf not labeled e, we
attach a unique integer. We refer to this integer as the position of the leaf and also as a position
of its symbol. Note that a symbol can have several positions; for instance, a has positions 1 and 3
in Fig. 3.56. The positions in the syntax tree correspond to the important states of the constructed
NFA.
Example: Figure 3.57 shows the NFA for the same regular expression as Fig. 3.56, with the
important states numbered and other states represented by letters. The numbered states in the
NFA and the positions in the syntax tree correspond in a way we shall soon see. •
To construct a DFA directly from a regular expression, we construct its syntax tree and then
compute four functions: nullable, firstpos, lastpos, and followpos, defined as follows. Each
definition refers to the syntax tree for a particular augmented regular expression ( r ) # .
1. nullable(n) is true for a syntax-tree node n if and only if the subexpression represented
by n has e in its language. That is, the subexpression can be "made null" or the empty string,
even though there may be other strings it can represent as well.
2. firstpos(n) is the set of positions in the subtree rooted at n that corre-spond to the first symbol
of at least one string in the language of the subexpression rooted at n.
3. lastpos(n) is the set of positions in the subtree rooted at n that corre-spond to the last symbol
of at least one string in the language of the subexpression rooted at n.
4. followpos(p), for a position p, is the set of positions q in the entire syntax tree such that there
is some string x = axa2 ••• an in L ( ( r ) # ) such that for some i, there is a way to explain the
membership of x in L ( ( r ) # ) by matching to position p of the syntax tree and ai+i to position g.
Example: Consider the cat-node n in Fig. 3.56 that corresponds to the expression (a|b)*a. We
claim nullable(ri) is false, since this node generates all strings of a's and 6's ending in an a; it
does not generate e. On the other hand, the star-node below it is nullable; it generates e along
with all other strings of a's and 6's.
firstpos{n) — {1,2,3} . In a typical generated string like aa, the first position of the string
corresponds to position 1 of the tree, and in a string like 6a, the first position of the string comes
from position 2 of the tree. However, when the string generated by the expression of node n is
just a, then this a comes from position 3.
lastpos{n) — {3}. That is, no matter what string is generated from the expression of node n, the
last position is the a from position 3 of the tree.
followpos is trickier to compute, but we shall see the rules for doing so shortly. Here is an
example of the reasoning: followpos(l) — {1,2,3} . Consider a string • • • ac • • •, where the c is
either a or 6, and the a comes from position 1. That is, this a is one of those generated by the a in
expression (a|b)*. This a could be followed by another a or 6 coming from the same
subexpression, in which case c comes from position 1 or 2. It is also possible that this a is the last
in the string generated by (a|b)*, in which case the symbol c must be the a that comes from
position 3. Thus, 1,2, and 3 are exactly the positions that can follow position 1. • • . '
We can compute nullable, firstpos, and lastpos by a straightforward recursion on the height of
the tree. The basis and inductive rules for nullable and firstpos are summarized in Fig. 3.58. The
rules for lastpos are essentially the same as for firstpos, but the roles of children c\ and c2 must
be swapped in the rule for a cat-node.
Example: Of all the nodes in Fig. 3.56 only the star-node is nullable. We note from the table of
Fig. 3.58 that none of the leaves are nullable, because they each correspond to non-e operands.
The or-node is not nullable, because neither of its children is. The star-node is nullable, because
every star-node is nullable. Finally, each of the cat-nodes, having at least one nonnullable child,
is not nullable.
The computation of firstpos and lastpos for each of the nodes is shown in
Fig. 3.59, with firstposin) to the left of node n, and lastpos(n) to its right. Each of the leaves has
only itself for firstpos and lastpos, as required by the rule for non-e leaves in Fig. 3.58. For the
or-node, we take the union of firstpos at the
children and do the same for lastpos. The rule for the star-node says that we take the value of
firstpos or lastpos at the one child of that node.
Now, consider the lowest cat-node, which we shall call n. To compute firstpos(n), we first
consider whether the left operand is mailable, which it is in this case. Therefore, firstpos for n is
the union of firstpos for each of its children, that is {1,2} U {3} = {1,2,3} . The rule for lastpos
does not appear explicitly in Fig. 3.58, but as we mentioned, the rules are the same as for
firstpos, with the children interchanged. That is, to compute lastpos(n) we must ask whether its
right child (the leaf with position 3) is nullable, which it is not. Therefore, lastpos(n) is the same
as lastpos of the right child, or {3}.
4 Computing followpos
Finally, we need to see how to compute followpos. There are only two ways that a position of a
regular expression can be made to follow another.
1 . If n is a cat-node with left child c \ and right child C 2 , then for every position i
in lastpos(ci), all positions in firstpos(c2) are in followpos(i).
Example : Let us continue with our running example; recall that firstpos and lastpos were
computed in Fig. 3.59. Rule 1 for followpos requires that we look at each cat-node, and put each
position in firstpos of its right child in followpos for each position in lastpos of its left child. For
the lowest cat-node in Fig. 3.59, that rule says position 3 is in followpos(l) and followpos{2). The
next cat-node above says that 4 is in followpos(2>), and the remaining two cat-nodes give us 5
in followpos(4) and 6 in followpos(5).
Figure 3.59: firstpos and lastpos for nodes in the syntax tree for (a|b)*abb#
We must also apply rule 2 to the star-node. That rule tells us positions 1 and 2 are in both
followpos(l) and followpos(2), since both firstpos and lastpos for this node are {1,2} . The
complete sets followpos are summarized in Fig. 3.60.
We can represent the function followpos by creating a directed graph with a node for each
position and an arc from position i to position j if and only if j is in followpos(i). Figure 3.61
shows this graph for the function of Fig. 3.60.
It should come as no surprise that the graph for followpos is almost an NFA without e-transitions
for the underlying regular expression, and would become one if we:
3. Make the position associated with endmarker # be the only accepting state.
METHOD :
1. Construct a syntax tree T from the augmented regular expression ( r ) # .
2. Compute nullable, firstpos, lastpos, and followpos for T, using the methods of Sections 3.9.3
and 3.9.4.
Construct Dstates, the set of states of DFA D, and Dtran, the transition function for D, by the
procedure of Fig. 3.62. The states of D are sets of
positions in T. Initially, each state is "unmarked," and a state becomes "marked" just before we
consider its out-transitions. The start state of D is firstpos(no), where node no is the root of T.
The accepting states are those containing the position for the endmarker symbol #. •
Example: We can now put together the steps of our running example to construct a DFA for the
regular expression r = (a|b)*abb. The syntax tree for ( r ) # appeared in Fig. 3.56. We observed
that for this tree, nullable is true only for the star-node, and we exhibited firstpos and lastpos in
Fig. 3.59. The values of followpos appear in Fig. 3.60.
The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D. Call this
set of states A. We must compute Dtran[A, a] and Dtran[A, b]. Among the positions of A, 1 and
3 correspond to a, while 2 corresponds to b. Thus, Dtran[A,a] = followpos(l) U followpos(3) =
{1,2,3,4},
initialize Dstates to contain only the unmarked state firstpos(no), where no is the root of syntax
tree T for ( r ) # ;
mark S;
in S that correspond to a;
if ( U is not in Dstates )
Figure 3.62: Construction of a DFA directly from a regular expression and Dtran[A, b] =
followpos{2) = {1,2,3} . The latter is state A, and so does not have to be added to Dstates, but the
former, B = {1,2,3,4}, is new, so we add it to Dstates and proceed to compute its transitions. The
complete DFA is shown in Fig. 3.63. •
There can be many DFA's that recognize the same language. For instance, note that the DFA's of
Figs. 3.36 and 3.63 both recognize language L ( ( a | b ) * a b b ) . Not only do these automata
have states with different names, but they don't even have the same number of states. If we
implement a lexical analyzer as a DFA, we would generally prefer a DFA with as few states as
possible, since each state requires entries in the table that describes the lexical analyzer.
The matter of the names of states is minor. We shall say that two automata are the same up to
state names if one can be transformed into the other by doing nothing more than changing the
names of states. Figures 3.36 and 3.63 are not the same up to state names. However, there is a
close relationship between the states of each. States A and C of Fig. 3.36 are actually equivalent,
in the sense that neither is an accepting state, and on any input they transfer to the same state —
to B on input a and to C on input b. Moreover, both states A and C behave like state 123 of Fig.
3.63. Likewise, state B of Fig. 3.36 behaves like state 1234 of Fig. 3.63, state D behaves like
state 1235, and state E behaves like state 1236.
It turns out that there is always a unique (up to state names) minimum state DFA for any regular
language. Moreover, this minimum-state DFA can be constructed from any DFA for the same
language by grouping sets of equivalent states. In the case of L ( ( a | b ) * a b b ) , Fig. 3.63 is
the minimum-state DFA, and it can be constructed by partitioning the states of Fig. 3.36 as {A,
C}{B}{D}{E}.
In order to understand the algorithm for creating the partition of states that converts any DFA
into its minimum-state equivalent DFA, we need to see how input strings distinguish states from
one another. We say that string x distinguishes state s from state t if exactly one of the states
reached from s and t by following the path with label x is an accepting state. State s is
distinguishable from state t if there is some string that distinguishes them.
Lexical Analysis is the first step of the compiler which reads the source code one character at a
time and transforms it into an array of tokens. The token is a meaningful collection of characters
in a program. These tokens can be keywords including do, if, while etc. and identifiers including
x, num, count, etc. and operator symbols including >,>=, +, etc., and punctuation symbols
including parenthesis or commas. The output of the lexical analyzer phase passes to the next
phase called syntax analyzer or parser.
The syntax analyser or parser is also known as parsing phase. It takes tokens as input from
lexical analyser phase. The syntax analyser groups tokens together into syntactic structures. The
output of this phase is parse tree.
It can separate tokens from the program and return those tokens to the parser as requested by it.
It can eliminate comments, whitespaces, newline characters, etc. from the string.
It can insert the token into the symbol table.
Lexical Analysis will return an integer number for each token to the parser.
Stripping out the comments and whitespace (tab, newline, blank, and other characters that are
used to separate tokens in the input).
The correlating error messages that are produced by the compiler during lexical analyzer with the
source program.
It can implement the expansion of macros, in the case of macro, pre-processors are used in the
source code.
LEX generates Lexical Analyzer as its output by taking the LEX program as its input. LEX
program is a collection of patterns (Regular Expression) and their corresponding Actions.
Patterns represent the tokens to be recognized by the lexical analyzer to be generated. For each
pattern, a corresponding NFA will be designed.
Lexical Analysis: Lexical analysis, also known as scanning is the first phase of a compiler
which involves reading the source program character by character from left to right and
organizing them into tokens. Tokens are meaningful sequences of characters. There are
usually only a small number of tokens for a programming language including constants (such
as integers, doubles, characters, and strings), operators (arithmetic, relational, and logical),
punctuation marks and reserved keywords.
Lexical Analysis
The lexical analyzer takes a source program as input, and produces a stream of tokens as
output.
Token
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the
programming languages.
Categories of Tokens
Keywords: In C programming, keywords are reserved words with specific meanings used
to define the language’s structure like if, else, for, and void. These cannot be used as
variable names or identifiers, as doing so causes compilation errors. C programming has a
total of 32 keywords.
Identifiers: Identifiers in C are names for variables, functions, arrays, or other user-
defined items. They must start with a letter or an underscore (_) and can include letters,
digits, and underscores. C is case-sensitive, so uppercase and lowercase letters are
different. Identifiers cannot be the same as keywords like if, else or for.
Constants: Constants are fixed values that cannot change during a program’s execution,
also known as literals. In C, constants include types like integers, floating-point numbers,
characters, and strings.
Operators: Operators are symbols in C that perform actions on variables or other data
items, called operands.
Special Symbols: Special symbols in C are compiler tokens used for specific purposes,
such as separating code elements or defining operations. Examples include ; (semicolon)
to end statements, , (comma) to separate values, {} (curly braces) for code blocks, and []
(square brackets) for arrays. These symbols play a crucial role in the program’s structure
and syntax.
Lexeme
A lexeme is an actual string of characters that matches with a pattern and generates a token.
eg- ―float‖, ―abs_zero_Kelvin‖, ―=‖, ―-‖, ―273‖, ―;‖ .
Lexemes and Tokens Representation
( LPAREN = ASSIGNMENT
a IDENTIFIER a IDENTIFIER
b IDENTIFIER 2 INTEGER
) RPAREN ; SEMICOLON
example, consider below printf statement. There are 5 valid token in this printf statement.
Exercise 1: Count number of tokens:
int main()
{
int a = 10, b = 20;
printf(“sum is:%d”,a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2: Count number of tokens:
int max(int i);
Lexical analyzer first read int and finds it to be valid and accepts as token.
max is read by it and found to be a valid function name after reading (
int is also a token , then again I as another token and finally ;
Answer: Total number of tokens 7: int, max, ( ,int, i, ), ;