0% found this document useful (0 votes)
7 views27 pages

PoCD Chapter 01 Handouts 2024-25

The document provides an introduction to compiler design, detailing the functions and processes involved in compiling source code into machine language. It covers the roles of various components such as preprocessors, lexical analyzers, parsers, and code generators, as well as the translation process and data structures used in compilers. Additionally, it discusses the Chomsky hierarchy of grammars and the different views of compilers, including analysis and synthesis, front end and back end, and the concept of passes.

Uploaded by

Akshat Dodwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views27 pages

PoCD Chapter 01 Handouts 2024-25

The document provides an introduction to compiler design, detailing the functions and processes involved in compiling source code into machine language. It covers the roles of various components such as preprocessors, lexical analyzers, parsers, and code generators, as well as the translation process and data structures used in compilers. Additionally, it discusses the Chomsky hierarchy of grammars and the different views of compilers, including analysis and synthesis, front end and back end, and the concept of passes.

Uploaded by

Akshat Dodwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Principles of Compiler Design

Chapter 01 – Introduction to
Compilers

PoCD Team
School of Computer Science & Engineering
2024 - 25
Principles of Compiler Design

THE TURN 01 Contents


Introduction to Compiler
• Why Compilers and
1. Compiler History of compilers
A compiler is a program that reads a program written • Language Processing
system
in one language - the source Language - and translates
• Translation Process
it into an equivalent program in another language - the • Chomsky Hierarchy
target language • Different views of
compiler
• Major Data
2.Analysis of a source program (Language structures in
Compilers
processing system) • Lexical Analysis:
Several other programs may be required to create an Scanning Process
• Lexical Errors
executable target program. A source program may be • Regular Expression
divided into modules stored in separate files. The task
of collecting the source program is sometimes assigned
=======================
to a distinct program, called a pre-processor. The pre- ““If you lie to the compiler it
processor may also expand shorthand’s, called macros, will get its revenge””
into source language statements. Figure:2.1 shows a
typical "compilation”. The target program created by
=====================
the compiler may require further processing before it
can be run. The compiler creates assembly code that is
translated by an assembler into machine code and then
linked together with some library routines into the
code that actually runs on the machine.

SoCSE Page | 2
Principles of Compiler Design

Figure:2.1

 The input to a compiler may be produced by one or more preprocessors, and further
processing of the compiler's output may be needed before running machine code is
obtained.

 Preprocessors produce input to compilers. They may perform the following functions:
Macro processing: A preprocessor may allow a user to define macros that are
shorthand’s for longer constructs.
File inclusion: A preprocessor may include header files into the program text. For
example, the C preprocessor causes the contents of the file <global.h> to replace the
statement #include<global.h>when it processes a file containing this statement.
 Assembler –Some compilers produce assembly code that is passed to an assembler for
further processing. Other compilers perform the job of the assembler, producing
relocatable machine code that can be passed directly to the loader link/editor.
SoCSE Page | 3
Principles of Compiler Design

Assembly code is a mnemonic version of machine code, in which names are used
instead of binary codes for operations, and names are also given to memory addresses.

 Loader/Linker – It is a program called a loader performs the two functions of loading


and link-editing. The process of loading consists of taking relocatable machine code,
altering the relocatable addresses as shown in Figure:2.1, and placing the altered
instructions and data in memory at the proper locations.
The link-editor allows us to make a single program from several files of relocatable machine
code. These files may have been the result of several different compilations, and one or more
may be library files of routines provided by the system and available to any program that
needs them. If the files are to be used together in a useful way, there may be some external
references, in which the code of one file refers to a location in another file. This reference may
be to a data location defined in one file and used in another, or it may be to the entry point of
a procedure that appears in the code for one file and is called from another file. The
relocatable machine code file must retain the information in the symbol table for each data
location or instruction label that is referred to externally. If we do not know in advance what
might be referred to, we in effect must include the entire assembler symbol table as part of
the relocatable machine code.

3.Compiler:
Compiler is a software which converts one language to another. A compiler takes a program
as input written in its source language which is a high-level language such as FORTRAN, C, C++
and produce an equivalent output program written in its target language which is machine
language and sometime called as object code consisting of machine instructions of the
computer on which it is to be executed. It can be graphically represented as shown in
Figure:3.1

SoCSE Page | 4
Principles of Compiler Design

Figure:3.1
Why compiler:
Initially the programs were written in machine language-numeric codes that represented the
actual machine operations to be performed. For example
C7 06 0000 0002
representing the instruction to move 2 to the location 0000.Writing such code is extremely
time consuming and tedious. Hence this form of coding was replaced by assembly language,
where the instructions and memory locations are represented in symbolic forms. For example
MOV x, 2
An assembler translates the symbolic codes and memory locations of assembly language into
the corresponding numeric codes of machine language. Assembly language improved the
speed and accuracy. However, it is still not easy to write and it is difficult to read and
understand. Assembly language is extremely dependent on particular machine for which it
was written, so code written for one machine must be completely rewritten for another
machine. So, there was a need for programming language which is resembling mathematical
notations or natural language, which is independent of any one particular machine and yet
capable of itself being translated by a program into executable code. For example, previous
instruction can be written in concise and machine independent form as
x=2
First machine independent language FORTRAN and its compiler developed by team IBM
between 1954 and 1957, and most of the process involved in translating programming
language were not understood at that time. Study of Natural languages by Noam Chomsky
made construction of compiler easy and even capable of partial automation.
SoCSE Page | 5
Principles of Compiler Design

There are other programs used together with the compiler are Interpreters, assembler, linker,
loader, pre-processor, editors, debuggers, profilers, project managers.
4.The translation process
A compiler consists of number of phases as shown in below Figure: 4.1

Figure :4.1

SoCSE Page | 6
Principles of Compiler Design

The scanner:
This phase reads the source program, which is in the form of stream of characters. The
scanner performs lexical analysis. It generates meaningful units from sequence of characters.
Consider the C language assignment statement
a[index] = 4 + 2
This statement contains 12 nonblank characters but only 8 tokens.
The tokens produced are shown in Table :4.1

Table :4.1
Along with generation of identifiers the scanner also enters identifiers into symbol table and
literals into literal table.
The parser:
The tokens generated from source code using scanner acts as input to the parser and
performs syntax analysis, which determines structure of the program. This phase determines
the structural elements of the program and their relationships. The output of syntax analysis
shown in Figure:4.2 is represented as a parse tree or syntax tree. The above assignment
expression consisting of a subscripted expression on the left and integer arithmetic
expression on the right and this structure can be represented as parse tree in the following
form.
SoCSE Page | 7
Principles of Compiler Design

Figure:4.2
The internal node of the parse tree is labelled by the names of the structure they represent
and leaves of the parse tree represent the sequence of tokens from the input. The parse tree
is insufficient in representing structure, so the parser generates syntax tree also called
abstract syntax tree. The abstract syntax tree for the above assignment statement abstract
syntax tree is shown in Figure :4.3

Figure:4.3
SoCSE Page | 8
Principles of Compiler Design

In the above syntax tree, many of the nodes have disappeared. For example, if we know that
an expression is a subscript operation, then it is not necessary to keep the brackets [ and].
The semantic analyzer:
This phase of the compiler represents the semantics of the program rather than syntax or
structure. The semantics analyzer does the analysis of static features of a program before the
execution, which cannot be conveniently expressed as syntax and analyzed by the parser. The
semantics usually determine runtime behavior. Figure:4.4 is the tree generated by semantic
analyzer for the assignment statement, where it associates meanings as shown.

Figure:4.4
The source code optimizer:
Compilers include code improvement or optimization steps. Optimization is done after
semantic analysis phase and many of the optimization can be done at source level. Individual
compiler exhibit variation not only in kinds of optimization performed but also in placement
of optimization phases. For the assignment statement we can perform source level
optimization by precomputation of the expression 4+2 by the compiler to the result 6. This
optimization is done directly on the syntax tree as shown in Figure: 4.5 by collapsing the right-
hand side subtree of the root node to a constant value called as constant folding.

SoCSE Page | 9
Principles of Compiler Design

Figure:4.5

Other types of optimization techniques that can be done on intermediate code (form of code
representation between source code and object code) are three code address and Pcode.
In three address scheme it contains up to three addresses in memory location as shown in
Figure: 4.6

Figure:4.6
The intermediate code can also be called as intermediate representation or IR.

SoCSE Page | 10
Principles of Compiler Design

The code generator:


The code generator takes IR as input and generates the code for target machine. In this phase
the code is generated depending on target machine. Instructions are represented as per
target machine and also decision need to be taken about data representation. The code is
represented in the form of assembly language for easy understanding as shown in Figure: 4.7,
but some machine also generates the object code directly.

Figure:4.7
The target code optimizer:
In this phase compiler improves the target code generated by the code generator. For
example, the improvements can be
• Choosing addressing mode to improve the performance

• Replacing slower instructions by faster one

• Eliminating redundant or unnecessary operations.


For the above target code generated few improvements can be done and one of them is to
use shift operation instead of multiplication and need to use powerful addressing mode such
as indexed addressing to perform array store. With above optimization the target code
becomes as follows shown in Figure :4.8

Figure:4.8
SoCSE Page | 11
Principles of Compiler Design

5.Major Data structures in a compiler


The compiler should be capable of implementing algorithms in efficient manner without too
much of complexity. Ideally the compiler should be capable of compiling a program in time
proportional to size of the program that is O(n) time, where n is size of the program. Hence
there is need of data structures by the phases of the compiler as part of their operation and
that serve to communicate information among phases.
Tokens:
When scanner recognizes characters(lexeme) as token, it represents the token symbolically,
i.e as a value of an enumerated data type representing the set of tokens of source language.
Sometimes we need to preserve the lexeme itself or other information derived from it, such
as name associated with an identifier token or the value of number token. The scanner
generates only one token at a time in most of the cases.
The Syntax Tree:
The syntax tree generated by parse tree is constructed as standard pointer-based structure
allocated dynamically as parsing proceeds. Entire tree can be kept as single variable pointing
to the root node. Each node in structure is a record representing the information collected
both by the parser and semantic analyzer.
The Symbol Table:
It is a data structure that keeps information associated with identifiers: functions, constants
and data types. The symbol table interacts almost phase of the compiler: the scanner, parser
or semantic analyzer that enter identifiers into table and also add datatype and other
information. The optimization and code generation phase use the information provided by
the symbol table to make appropriate object code choices. Since the symbol table is used
frequently the insert, delete and access operations need to be constant time operations and
preferably we use hash table for the same.
The Literal Table:
This table stores constants and strings used in the program. Insertion and lookup operations
are also essential. Literal table should not support for deletion operation as the data applies

SoCSE Page | 12
Principles of Compiler Design

globally to the program and a constant or string will appear only once in this table. The literal
table reduces the size of the program in memory by reusing constants and strings. The literal
table also used by code generator to construct symbolic addresses for literals and for entering
data definitions in target code file.
Intermediate code:
The intermediate code may be kept as an array of text strings, a temporary text file or as a
linked list of structures depending on the kind of representation (three address code and P-
code) and the kind of optimization performed.
Temporary files:
Computers did not possess enough memory for an entire program to be kept in memory
during compilation, so during the process We can use temporary files to hold the results of
intermediate steps.

5. Chomsky Hierarchy:
When the first compiler was under development the findings of Noam Chomsky on the natural
languages made the construction of compilers easy and even partial automation. Chomsky
introduced the classification of languages according to the complexity of their grammars and
the power of the algorithms needed to recognize them. The Chomsky hierarchy of languages
consists of four levels of grammars called type 0, type 1, type 2 and type3 grammars as shown
in Figure :5.1. type2 or context free grammar is standard way to represent structure
programming language. Regular expression and finite automata correspond to type3 or
regular grammar and are closely associated with type2.

SoCSE Page | 13
Principles of Compiler Design

Figure :5.1

7.Different views of compiler:


Analysis and Synthesis
The operations that analyze the source program to compute its properties are classified as
the analysis part of the compiler, while operations involved in producing translated code are
called the synthesis part of the compiler. Accordingly, the lexical analysis, syntax analysis and
semantic analysis belong to analysis part and code generation belongs to synthesis.
Optimization belongs to both analysis and synthesis.
Front End and Back End
The operations that depend only on the source language is called front end and the operations
that depend only on the target language is called back end. The scanner, parser and semantic
analyzer are part of front end while code generator is part of back end. Some optimization
can be target dependent so it is part of back end whereas intermediate code is target
independent and thus part of the front end. This structure is very important for compiler
portability.
SoCSE Page | 14
Principles of Compiler Design

Passes:
The compiler processes the entire source program several times referred as passes before
generating code. There exists one pass or multipass compilers depending on the level of
optimization.

8.Lexical Analysis: Scanning Process


The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters
and produce as output a sequence of tokens that the parser uses for syntax analysis. The
interaction, between scanner and parser can be summarized schematically as shown in Figure:
8.1

Figure: 8.1
Upon receiving a "get next token" command from the parser, the lexical analyzer reads input
characters until it can identify the next token. The main functionality of the lexical analysis is
to read input characters and produce tokens by interacting with symbol table. The other
functionalities are stripping out comments and white spaces from the source program,
correlating error messages from the compiler with the source program and macro expansion.
Sometimes the lexical analyzers are divided into a cascade of two phases. The first phase is
called as scanning and second lexical analysis. Below are some advantages of having these
phases separate.
1.Simplicity
2.Efficiency

SoCSE Page | 15
Principles of Compiler Design

3.Portability

Terminology:
Lexeme: A sequence of input characters that comprises a single token is called a lexeme.
Eg: float, total, =
Token: Lexeme identified as valid tokens using predefined rules.
Eg: Identifier, Strings, Keywords
Pattern: Strings described by a rule.
The below table shows the difference between lexeme, token and pattern.
Few of the tokens, lexeme and pattern are shown in Table :8.1

Table:8.1
Attributes for tokens:
The lexical analyzer must provide additional information about the particular lexeme that
matched to the subsequent phases of compiler. The lexical analyzer collects information
about tokens into their associated attributes. Usually a token has only a single attribute-a
pointer to the symbol table entry in which the information about the token is kept. For
example, we may require the information of both lexeme and its associated line number for

SoCSE Page | 16
Principles of Compiler Design

a token of type identifier. Both this information can be stored in symbol table entry for the
identifier.
Consider the statement E = M * C ** 2
The tokens and associated attribute values are as follows
<id, pointer to symbol table entry E>
<assign_op,>
<id, pointer to symbol table entry M>
<mul_op,>
<id, pointer to symbol table entry C>
<exp_op,>
<num, integer value 2>

9.Lexical errors:
There are mainly three types of compile time errors and they are lexical, syntactic and
semantic errors.
Lexical error is an input that can be rejected by the scanner. Types of lexical errors are:
A sequence of characters that cannot be scanned into any valid token.
Misspelling of identifiers, keywords or operators.
Example if the string ‘fi ’is encountered for first time in C program as
fi(a==f(x))…..
The scanner cannot tell whether fi is a misspelling of if keyword or an undeclared function
identifier, since fi is valid lexeme for token identifier, the scanner must return token id to
parser and it does this. But in reality, it is an error which is not identified by the scanner. In
such cases the other phases of the compiler recognize the error probably the parser.
The scanner uses a method called as panic error recovery method whenever none of the
patterns for tokens matches any prefix of remaining input and scanner is unable to proceed.
In such cases the unmatched patterns are deleted successively from the remaining input until
the lexical analyzer can find well-formed token at the beginning of the input left out.

SoCSE Page | 17
Principles of Compiler Design

Other error recovery techniques are:


1. Delete one character from remaining input

2. Insert a missing character into remaining input.

3. Replace a character by another character.

4. Transpose two adjacent characters.

10.Regular Expression:
Basic Definitions:
Alphabets, Languages and Grammar
Alphabets are finite, non-empty set of symbols, usually denoted by Σ

Examples:
∑ = {0, 1} , the binary alphabet
∑ = {a, b, c, ….z}, set of all lower case letters

Strings
 Strings are finite sequences of symbols chosen from the alphabet. Example: 0110 is a
string from binary alphabet ∑ = {0, 1}
 Strings are denoted by u, v, w, x, y, z. Example u = 10110

Empty String
Zero occurrences of symbols. It is denoted by Є

Length of a string
The number of symbols in a string
Notion: |w|
|00101| = 5
u = 101 and |u| = 3
For an Empty string, length is 0

Reverse of a string
w = ab then wR = ba

SoCSE Page | 18
Principles of Compiler Design

Concatenation of strings
If w1 and w2 are two strings, then concatenation of is w1.w2
w.Є = Є.w = w for all w
w1.w2 ≠ w2.w1
The length of concatenated strings is sum of length of two strings

Powers of an alphabet
If ∑ = {a, b, c}
Then
∑0 = Є
∑1 = {a, b, c}
∑2 = {aa, bb, cc, ab, ac, ba, bc, ca, cb}
And so on…

∑* = ∑0 U ∑1 U ∑2 U ∑3 U ….
∑* = {a, b, c} * = {Є, a, b, ab, aa, abb, aab,…}
A ∑* closure contains Є. But a ∑+ closure does not contain Є.

∑+ = ∑* - {Є}
∑+ = ∑1 U ∑2 U ∑3 U ….
∑* = ∑+ U {Є}

Languages
A set of all strings all of which are chosen from some ∑* where ∑ is a particular alphabet, is
called a language.
If ∑ is an alphabet, and L ⊆ ∑*, then L is language over ∑.

Examples:
 The language of all the strings consisting of n o’s followed by n 1’s for some n > = 0 is {
Є , 01, 0011, 000111, …}
 The set of strings of 0’s and 1’s with an equal number of each is
{ Є , 01, 0011, 0101, 1001,…}
 ∑* is a language for any alphabet ∑
 Ø, the empty language, is a language over any alphabet
 { Є }, the language consisting of only empty string, is also a language over any alphabet.
Ø ≠ Є.

SoCSE Page | 19
Principles of Compiler Design

Regular expression ‘r’ represents patterns of strings of characters and is completely defined
by set of strings that it matches. This set is called language generated by the regular
expression represented as L(r). The language depends on character set generally ASCII
character or subset of it or it could be generic, in that case the elements of set are referred as
symbols. This set of symbols called the alphabet and represented using Greek symbol Σ.A
regular expression can also be a character from alphabet but have different meaning i.e all
symbols indicate patterns. (During discussion it will be mentioned what will be the context)
A regular expression may contain characters that have special meaning. Such characters are
called metacharacters or meta symbols. The metacharacter may not be legal character form
alphabet and need to be distinguished from other characters in the alphabet, using escape
character (“turns off” special meaning). Example backslash and quote.
First, we describe basic set of regular expressions then we describe operations that generate
new regular expressions from existing one.
Basic Regular Expressions
1. These are single charaters from the alphabet, which match themselves.
Example:a character a from Σ, we can indicate that regular expression a match the character
a by writing L(a)={a}.
Special cases:
2. A match of the empty string ℇ(epsilon), that is string that contains no character and
Language generated can be represented as L(ℇ)={ℇ}.
3. A match of empty set Φ, that matches no string at all. This language is called empty set {}
and we write L(Φ)= { }.

Regular Expression Operations


Three basic operations in regular expressions:
1. Choice among alternatives, indicated by the metacharacter |
2. Concatenation, indicated by juxtaposition (without metacharacter)
3. Repetition or closure, indicated by the metacharacter *

SoCSE Page | 20
Principles of Compiler Design

1.Choice among alternatives:


If r and s are regular expressions,then r|s is a regualar expression which matches any string
that is matched either by r or by s.The language generated by r|s is the union of the languages
of r and s or L(r|s)=L(r) UL(s).
1. Example:The language generated by the choice among the regular expression a|b is
represenetd as L(a|b)=L(a) U L(b)={a}U{b}={a,b}
2. L(a|ℇ) =L(a) U L(ℇ)={a}U{ℇ}= {a, ℇ}
3. L(a|b|c|d) ={a,b,c,d}
4. L (a|b|c|………|z) matches any of the lowercase letters a through z

2.Concatenation:
The concatenation of two regularexpressions r and s is written as rs and it matches any string
that is concatenation of two strings,the first of which matches r and second of which matches
s.
1. The regular expression ab matches only the string ab.i.e L(ab)=L(a)L(b)={ab}
2. The regular expression(a|b)c matches string ac and bc.(paranthesis will be discussed in next
seesions) i.eL((a|b)c)=L(a|b)L(c)={a,b}{c}={ac,bc}
The concatenation can be extended to more than two regular expressions.

3.Repetition: The repetition operation of a regular expression also called Kleene closure, is
written as r*, where r is regular expression. r*matches any finite concatenation of strings,
each of matches r.
i.e
a*= ℇ, a, aa, aaa ,aaaa……..
For set of strings * can be defined as
S*= ℇ U S U SS U SSS U ……….
i.e

SoCSE Page | 21
Principles of Compiler Design

Where Sn=S………. S is concatenation of S n-times. S0= {ℇ}


For a regular expression we can write as follows.

Example:
(a|bb)* matches ℇ, a, aa, aaa, abb, bba, bbbb, aabb so on.
In terms of languages L((a|bb)*)=L(a|bb)*={a,bb}*={ ℇ, a, aa, aaa, abb, bba, bbbb, aabb,
aabbbaa, aabb, bbaaa…….}

Precedence of operations and Use of parenthesis:


The operator * have higher precedence among all other regular expression operators, next
highest is concatenation operator and | is given lowest.
Example:
a|b* interpreted as a|(b)*
a|bc* interpreted a| (b (c)*)
If we want to indicate different precedence, we must use parenthesis.
Example:
(a|b) c indicate that choice operation should be given higher precedence than concatenation.
Note: Use of parenthesis is entirely analogous to their use in arithmetic.

Name for Regular Expression:


It is always useful to give name for longer regular expression.
Example: Regular expression for a sequence of one or more numeric digits
(0|1|2|3|4……|9) (0|1|2|3|4……|9) * represented as
digit digit*
where digit=0|1|2|3…. |9 is a regular definition of the name digit.
SoCSE Page | 22
Principles of Compiler Design

Note:
The same language can be generated by many different regular expressions. Required to find
small and efficient regular expression.
Not all sets of strings that we can describe can be generated by regular expression.

C programming language tokens falls into following categories


1. Keywords- if, while, do
2. Special symbols- arithmetic operators, assignment and equality (single character (=) or
multiple character (: = or ++)
3. Identifiers-sequence of letters and digits beginning with a letter
4. Literals or Constants-Numeric constants, string literals, characters

Example:
1. Unsigned decimal Integer:
digit=0|1|2|3|4|5|6|7|8|9
digitdigit*

2. Signed Integer:
digit=0|1|2|3|4|5|6|7|8|9
sign=+|-|ℇ
SignedInteger=sign digit digit*

3. Keywords:
Key_word=int | float | char

SoCSE Page | 23
Principles of Compiler Design

More Examples:
4. KLETU student three-digit roll number (ranging from 001 to 999)
d1= 0|1|2|3|4|5|6|7|8|9
d2= 1|2|3|4|5|6|7|8|9
Roll_no=00d2 | 0d2d1 |d2d1d1

5. Student Registration Number of formats 01FE19BCS154


SRN=01FE19BCSRoll_no (where Roll_no taken from previous question)
Suppose if we need to generate the SRN for Civil and Mechanical with code CV and ME
respectively then the regular expression will become
SRN=01FE19B(CS|CV|ME) Roll_no

6. Identifiers:
letter=a| b| c| d|………|z| A|B|C|………. |Z
digit=0 | 1 |2 | 3 |………. | 9
Identifier=letter(letter|digit) *

7. Set of all strings containing exactly one b over the alphabet Σ={a,b,c}
(a|c) *b (a|c) *
8. Set of all strings ending with abb over the alphabet Σ={a,b}
(a|b)*abb
9. Set of all strings starting with 0 and ending with 1 over the alphabet Σ= {0,1}
0(0|1) *1

Regular expression for extended operator


Using basic regular expression operator creating regular expression is complicated. We
require operators which are more expressive. In the following discussion we will describe
some extension to standard regular expressions with their corresponding meta symbols.

SoCSE Page | 24
Principles of Compiler Design

One or more repetitions


The regular expression r* allows r to be repeated 0 or more times. If the need arises for one
or more repetitions instead none (disallow ℇ).
Consider a natural number where we require sequence of digits, but we want at least one
digit to appear.
To match binary number the regular expression is (0 |1)* which also match empty string. So,
the expression will be
(0|1) (0|1) *
For this reason, a standard notation has been + developed and the regular expression r+
matches one or more repetitions of r. Thus, the previous regular expression can be written as
(0|1) +.
Any character
To match any character in the alphabet we use the period “.” Which does not require every
character in the alphabet be listed in an alternative.
The regular expression to match all strings that contain at least one b is as follows:
.*b.*
A range of characters
Range of characters, such as lowercase letters or all digits can be written
a|b|….|z or 0|12|….|9 respectively.
Instead of this representation we can use square brackets and a hyphen. This notation is called
character class.
e.g: [a-z] represent lowercase letters and [0-9] is used to represent digits.
a|b|c an be written as [abc]
[a-zA-Z] represent all lowercase and uppercase letters.
Writing[A-z] will not match the same characters as [a-zA-Z]
Any character not in a given set
Excluding a single character from set of characters to be matched is achieved by indicating
the “not” or complement operation on a set of alternatives. We use tilde character ~.

SoCSE Page | 25
Principles of Compiler Design

E.g: ~a is used to match any character other than a


~(a|b|c) is used to match any character other than a or b or c
Optional subexpression
Optional parts of string that may or may not appear in any particular string can be represented
using alternatives as follows.
digit=[0-9]+
SignedInteger =digit | +digit | -digit
The above expression is wide so we use the following method to make it simple
We use ? metacharacter to match strings as optional.
Thus the above expression can be written as
SignedInteger = (+|-)? Digit
RE for c tokens and real-world examples using extended operators
Examples:

1. Signed Integers:
digit=[0-9]+
sign=+|-
SignedInteger=(sign)? digit

2. Identifiers:
letter=[a-zA-Z]
digit=[0-9]
Identifier=letter(letter|digit)*

3. if and for
keyword=if | for

SoCSE Page | 26
Principles of Compiler Design

4. Regular expression for the language starting and ending with a and having any having
any combination of b's in between
R = a b* b

5. Assume we would like our password to contain all of the following, but in no particular
order:
At least one digit [0-9]
At least one lowercase character [a-z]
At least one uppercase character [A-Z]
At least one special character [*.! @#$%^&(){}[]:;<>,.?/~_+-=|\]
At least 8 characters in length, but no more than 32.
(?. *[0-9]) (? . *[a-z]) (?.*[A-Z]) (?.*[*.!@#$%^&(){}[]:;<>,.?/~_+-=|\]){8,32}

6. The valid pin code of India must satisfy the following conditions.
It can be only six digits.
It should not start with zero.
First digit of the pin code must be from 1 to 9.
Next five digits of the pin code may range from 0 to 9.
It should allow only one white space, but after three digits.
[1-9]{1} [0-9]{2} \s {0,1} [0-9]{3}

~*~*~*~*~*~*~*~*~*~*~*~

SoCSE Page | 27

You might also like