0% found this document useful (0 votes)
3 views25 pages

CD - Unit I

The document provides an overview of compiler design, including the roles of various components such as preprocessors, compilers, assemblers, interpreters, loaders, and link-editors. It details the phases of compilation, including lexical analysis, syntax analysis, intermediate code generation, code optimization, and code generation, along with the concepts of bootstrapping and cross-compilation. Additionally, it discusses the importance of lexical analysis in identifying tokens and the differences between tokens, lexemes, and patterns.

Uploaded by

mslucky863
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views25 pages

CD - Unit I

The document provides an overview of compiler design, including the roles of various components such as preprocessors, compilers, assemblers, interpreters, loaders, and link-editors. It details the phases of compilation, including lexical analysis, syntax analysis, intermediate code generation, code optimization, and code generation, along with the concepts of bootstrapping and cross-compilation. Additionally, it discusses the importance of lexical analysis in identifying tokens and the differences between tokens, lexemes, and patterns.

Uploaded by

mslucky863
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

UNIT I
Introduction: Overview of compilation, Language Processors, The structure of a Compiler, Pass
and Phases of translation, Interpretation and bootstrapping.

Lexical Analysis: The Role of the Lexical Analyzer, Input Buffering, Recognition of Tokens,
Design of a Lexical-Analyzer Generator, Optimization of DFA-Based Pattern Matchers, The Lexical-
Analyzer Generator(LEX) tool.

1.1 .1OVERVIEW OF LANGUAGE PROCESSING SYSTEM


A computer is a blend of hardware and software, where hardware executes
instructions as binary code (0s and 1s). Writing in binary is complex, so programmers use
high-level languages, which are easier to understand and remember. These programs
are processed by operating system components and devices to generate machine-
readable code. This system of converting high-level language to binary is called a
language processing system.

1.1.1.1 PREPROCESSOR
A pre-processor produce input to compilers. They may perform the following functions.

 Macro processing: A pre-processor may allow a user to define macros that are
short hands for longer constructs.
 File inclusion: A pre-processor may include header files into the program text.
 Rational pre-processor: these pre-processors augment older languages with
more modern flow-of-control and data structuring facilities.
 Language Extensions: These pre-processor attempts to add capabilities to the
language by certain amounts to build-in macro
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

1.1.1.2 COMPILER
Compiler is a translator program that translates a program written in (HLL) the
source program and translates it into an equivalent program in (MLL) the target
program. As an important part of a compiler is error showing to the programmer.

Executing a program written n HLL programming language is basically of two parts.


The source program must first be compiled translated into an object program. Then
the results object program is loaded into a memory executed.

1.1.1.3 ASSEMBLER
Programmers found it difficult to write or read programs in machine language.
They begin to use a mnemonic (symbols) for each machine instruction, which they
would subsequently translate into machine language. Such a mnemonic machine
language is now called an assembly language. Programs known as assembler were
written to automate the translation of assembly language in to machine language.
The input to an assembler program is called source program, the output is a
machine language translation (object program).

1.1.1.4 INTERPRETER:
An interpreter is a program that appears to execute a source program as if it
were machine language.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

Languages such as BASIC, SNOBOL, and LISP can be translated using interpreters. JAVA
also uses interpreter. The process of interpretation can be carried out in following
phases.

 Lexical analysis
 Syntax analysis
 Semantic analysis
 Direct Execution
Advantages
 Modification of user program can be easily made and implemented as execution
proceeds.
 Type of object that may denotes various change dynamically.
 Debugging a program and finding errors is simplified task for a program used for
interpretation.
 The interpreter for the language makes it machine independent.

Disadvantages
 The execution of the program is slower.
 Memory consumption is more.
1.1.1.5 Loader and Link-editor:
Once the assembler procedures an object program, that program must be placed
into memory and executed. The assembler could place the object program directly in
memory and transfer control to it, thereby causing the machine language program to be
execute. This would waste core by leaving the assembler in memory while the users
program was being executed. Also the programmer would have to retranslate his program
with each execution, thus wasting translation time. To overcome this problems of wasted
translation time and memory. System programmers developed another component called
loader

“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form
the loader could relocate” directly behind the user’s program. The task of adjusting
programs o they may be placed in arbitrary core locations is called relocation. Relocation
loaders perform four functions.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
1.1.2 STRUCTURE OF THE COMPILER DESIGN

Phases of a compiler: A compiler operates in phases. A phase is a logically interrelated


operation that takes source program in one representation and produces output in another
representation. The phases of a compiler are shown in below there are two phases of
compilation.

a. Analysis (Machine Independent/Language Dependent)

b. Synthesis (Machine Dependent/Language independent)

Compilation process is partitioned into


1.1.2.1 PHASES OF A COMPILER

No-of-sub processes called ‘phases’.


KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Lexical Analysis:-

LA or Scanners reads the source program one character at a time, carving the source
program into a sequence of atomic units called tokens.

Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase
expressions, statements, declarations etc are identified by using the results of lexical
analysis. Syntax analysis is aided by using techniques based on formal grammar of the
programming language.

Intermediate Code Generations:-


An intermediate representation of the final machine language code is produced. This
phase bridges the analysis and synthesis phases of translation.

Code Optimization:-
This is optional phase described to improve the intermediate code so that the output
runs faster and takes less space.

Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this phase. The
output of the code generator is the machine language program of the specified computer.

Table Management (or) Book-keeping:-


This is the portion to keep the names used by the program and records essential
information about each. The data structure used to record this information called a “Symbol
Table‟.

Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of LA is
a stream of tokens, which is passed to the next phase, the syntax analyzer or parser. The SA
groups the tokens together into syntactic structure called as expression. Expression may
further be combined to form statements. The syntactic structure can be regarded as a tree
whose leaves are the token called as parse trees.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

1.1.3 Phases in Compiler:

Generally, phases are divided into two parts:

1.1.3.1Front End phases:

The front end consists of those phases or parts of phases that is source language-
dependent and target machine, independents. These generally consist of lexical analysis,
semantic analysis, syntactic analysis, symbol table creation, and intermediate code
generation. A little part of code optimization can also be included in the front-end part.
The front-end part also includes the error handling that goes along with each of the
phases.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

1.1.3.2 Back End phases:

The portions of compilers that depend on the target machine and do not depend on
the source language are included in the back end. In the back end, code generation and
necessary features of code optimization phases, along with error handling and symbol
table operations are also included.

1.1.4 Passes in Compiler:

A pass is a component where parts of one or more phases of the compiler are
combined when a compiler is implemented. A pass reads or scans the instructions of the
source program or the output produced by the previous pass, which makes necessary
transformation specified by its phases.

There are generally two types of passes

1. One-pass
2. Two-pass

Grouping

Several phases are grouped together to a pass so that it can read the input file and write an
output file.

1. One-Pass – In One-pass all the phases are grouped into one phase. The six phases
are included here in one pass.
2. Two-Pass – In Two-pass the phases are divided into two parts i.e. Analysis or Front
End part of the compiler and the synthesis part or back end part of the compiler.

1. Purpose of One Pass Compiler

A one-pass compiler generates a structure of machine instructions as it looks like a


stream of instructions and then sums up with machine address for these guidelines to a
rundown of directions to be back patched once the machine address for it is generated. It is
used to pass the program for one time. Whenever the line source is handled, it is checked
and the token is removed.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

2. Purpose of Two-Pass Compiler

A two-pass compiler utilizes its first pass to go into its symbol table a rundown of
identifiers along with the memory areas to which these identifiers relate. Then, at that
point, a second pass replaces mnemonic operation codes by their machine language
equivalent and replaced uses of identifiers by their machine address. In the second pass, the
compiler can read the result document delivered by the first pass, assemble the syntactic
tree and deliver the syntactical examination. The result of this stage is a record that contains
the syntactical tree.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

1.1.5 Interpretation

An interpreter converts high-level programming language into machine code line-by-


line, unlike compilers and assemblers. It executes instructions immediately, stopping when
errors occur and allowing easier error correction. However, interpreted programs are slower
to execute than compiled ones and require the source code to run every time.

Interpreters were first used in 1952, making programming simpler given the limitations of
early computers. They are commonly used in micro-computers and help programmers
debug errors before moving to the next statement.

A self-interpreter is an interpreter written in the same language it interprets, like a BASIC


interpreter written in BASIC. Examples of languages with elegant self-interpreters include
Lisp and Prolog

Advantages and Disadvantages of Interpreters

 The advantage of the interpreter is that it is executed line by line which helps users
to find errors easily.
 The disadvantage of the interpreter is that it takes more time to execute successfully
than the compiler.

Applications of Interpreters

 Each operator executed in a command language is usually an invocation of a


complex routine, such as an editor or compiler so they are frequently used to
command languages and glue languages.
 Virtualization is often used when the intended architecture is unavailable.
 Sand-boxing
 Self-modifying code can be easily implemented in an interpreted language.
 Emulator for running Computer software written for obsolete and unavailable
hardware on more modern equipment.

1.1.6 Bootstrapping

Bootstrapping is a key technique in compiler design, where a basic compiler is used to


build and improve more advanced versions of itself. It enables the development of
compilers for new programming languages and enhances existing ones over time.

 It relies on self-compiling compilers, with each iteration improving their ability to


handle complex code.
 The process simplifies the development cycle, supporting incremental improvements
and quicker deployment of robust compilers.
 Many programming languages, such as C and Java, have successfully employed
bootstrapping techniques during their development.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

Bootstrapping is the process of building compilers through iterative development:

1. Start with a Basic Compiler: A simple compiler is created using a basic language
(e.g., assembly language). It handles essential features of a programming language.
2. Create an Advanced Version: The basic compiler is used to compile a more advanced
version, which can handle additional features like better error checking and
optimizations.
3. Gradually Improve: Each version of the compiler builds on the previous one, adding
more features and improving efficiency. This process continues until the desired
result is achieved.

In the T-diagram:

1. Step 1: The source language is a subset of C (C0), the target language is Assembly,
and the implementation language is also Assembly.
2. Step 2: Using the C0 compiler, a compiler for the full C language is created, with C as
the source language and Assembly as the target language.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Cross compilation
Cross-compilation is a process where a compiler runs on one platform (host) but
generates machine code for a different platform (target). This is useful when the target
platform is not powerful enough to run the full compiler or when the target architecture is
different from the host system. Using bootstrapping in cross-compilation can help create a
compiler that runs on one system (the host) but produces code for another system (the
target).

For instance, to create a cross-compiler for language X generating code in Z, an


existing compiler Y (on machine M) is used to build a basic compiler, XYZ. This XYZ compiler
translates X code to Z code while running on M, resulting in a cross-compiler (XMZ). This
approach allows the creation of a compiler for language X without directly needing the
target system.

Advantages of Bootstrapping:

1. Improved Efficiency: Speeds up compiler development by using basic compilers to


create advanced ones.
2. Portability: Allows compilers to work across various systems, making them more
flexible.
3. Reduced Dependency: Minimizes reliance on external tools or software by enabling
self-sufficient compiler creation.

Challenges of Bootstrapping:

1. Initial Effort: Requires significant time and effort to build the first simple compiler.
2. Complexity of Self-Compilation: Ensuring the compiler can compile itself while
supporting advanced features is challenging.
3. Time Consumption: Iterative improvements in early stages are slow and resource-
intensive.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Lexical Analysis
1.2.1OVER VIEW OF LEXICAL ANALYSIS

1. To identify the tokens we need some method of describing the possible tokens
that can appear in the input stream. For this purpose we introduce regular
expression, a notation that can be used to describe essentially all the tokens of
programming language.

2.Secondly, having decided what the tokens are, we need some mechanism to
recognize these in the input stream. This is done by the token recognizers, which
are designed using transition diagrams and finite automata.

1.2.2 ROLE OF LEXICAL ANALYZER

The LA is the first phase of a compiler. It main task is to read the input
character and produce as output a sequence of tokens that the parser uses for syntax
analysis.

Upon receiving a get next token‟ command form the parser, the lexical
analyser reads the input character until it can identify the next token. The LA return to
the parser representation for the token it has found. The representation will be an
integer code, if the token is a simple construct such as parenthesis, comma or colon.

LA may also perform certain secondary tasks as the user interface. One such
task is striping out from the source program the commands and white spaces in the
form of blank, tab and new line characters. Another is correlating error message from
the compiler with the source program.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

LEXICAL ANALYSIS VS PARSING:

TOKEN, LEXEME, PATTERN:

Token: Token is a sequence of characters that can be treated as a single logical


entity. Typical tokens are,

1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants

Pattern: A set of strings in the input for which the same token is produced as
output. This set of strings is described by a rule called a pattern associated with the
token.

Lexeme: A lexeme is a sequence of characters in the source program that is matched


by the pattern for a token.
Example:
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
A patter is a rule describing the set of lexemes that can represent a particular token in
source program.

LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. This
means that there’s no way to recognise a lexeme as a valid token for you lexer. Syntax
errors, on the other side, will be thrown by your scanner when a given set of already
recognised valid tokens don't match any of the right sides of your grammar rules. Simple
panic-mode error handling system requires that we return to a high-level parsing
function when a parsing or lexical error is detected.

Error-recovery actions are:

1. Delete one character from the remaining input.


2. Insert a missing character in to the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

1.2.3 INPUT BUFFERING

Input buffering is a critical concept in compiler design that improves the efficiency of
reading and processing source code. Typically, a compiler scans the input one character at a
time, which can be slow and inefficient. Input buffering addresses this issue by allowing the
compiler to read chunks of input data into a buffer before processing them. This reduces the
number of system calls, each of which carries overhead, thereby improving performance.

A buffer is essentially a temporary storage area where a block of input data is


loaded. The size of this buffer can vary depending on the compiler's specific needs and the
type of source code being compiled. For instance, compilers for high-level programming
languages might use larger buffers, as these languages often have longer lines of code, while
compilers for low-level languages may use smaller buffers.

One major advantage of input buffering is its ability to reduce the frequency of
system calls needed to read the source code, leading to faster compilation times.
Additionally, it simplifies the compiler's design by minimizing the amount of code required
for input management.

However, input buffering is not without its challenges. If the buffer size is excessively
large, it can consume too much memory, potentially leading to slower performance or even
crashes, especially on systems with limited resources. Furthermore, improper management
of the buffer can result in errors during compilation, such as incorrect processing of the
input data.

Initially both the pointers point to the first character of the input string as shown below
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

In the process of lexical analysis, the forward pointer (fp) scans the input to identify the end
of a lexeme. When a blank space is encountered, it signifies the end of the current lexeme
(e.g., recognizing the lexeme "int"). The fp then moves ahead, skipping the white space,
while both the begin pointer (bp) and fp are reset to the starting position of the next token.

However, reading characters directly from secondary storage is resource-intensive and


inefficient. To address this, the buffering technique is employed. A block of data is initially
loaded into a buffer, reducing the number of direct accesses to secondary storage. The
lexical analyzer then processes characters from the buffer instead.

There are two primary methods used in input buffering:

1. One Buffer Scheme: This approach uses a single buffer to hold the input data. It is
simpler but may require extra effort to manage overlapping lexemes.
2. Two Buffer Scheme: This method employs two buffers alternately. While one buffer
is being processed, the other is being filled with the next block of input, enabling
seamless processing and reducing delays caused by input operations.

One Buffer Scheme: In this scheme, only one buffer is used to store the input string but
the problem with this scheme is that if lexeme is very long then it crosses the buffer
boundary, to scan rest of the lexeme the buffer has to be refilled, that makes overwriting
the first of lexeme.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Two Buffer Scheme: The Two Buffer Scheme improves input buffering by using two
alternating buffers to store input. When one buffer is processed, the other is filled with the
next block of data, ensuring uninterrupted processing. Initially, both the begin pointer (bp)
and forward pointer (fp) point to the first character of the first buffer. The fp moves right to
find the end of a lexeme, which is marked by a blank space. The lexeme is identified as the
string between bp and fp.

To mark buffer boundaries, a Sentinel (end-of-buffer character) is placed at the end of each
buffer. When fp encounters the first sentinel, the second buffer is filled. Similarly,
encountering the second sentinel prompts refilling of the first buffer. This process continues
until all input is processed. A limitation of this method is that lexemes longer than the buffer
size cannot be fully scanned. Despite this, the scheme efficiently reduces secondary storage
access delays.

1.2.4 Recognition of Tokens

 Tokens obtained during lexical analysis are recognized by Finite Automata.


 Finite Automata (FA) is a simple idealized machine that can be used to recognize
patterns within input taken from a character set or alphabet (denoted as C). The
primary task of an FA is to accept or reject an input based on whether the defined
pattern occurs within the input.
 There are two notations for representing Finite Automata. They are:

1. Transition Table
2. Transition Diagram

1.2.4.1 Transition Table


It is a tabular representation that lists all possible transitions for each state and input
symbol combination.

EXAMPLE
Assume the following grammar fragment to generate a specific language
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

Where the terminals if, then, else, relop, id and num generates sets of strings given by
following regular definitions

 where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-9])
 For this language, the lexical analyzer will recognize the keywords if, then, and else,
as well as lexemes that match the patterns for relop, id, and number.
 To simplify matters, we make the common assumption that keywords are also
reserved words: that is they cannot be used as identifiers.
 The num represents the unsigned integer and real numbers of Pascal.
 In addition, we assume lexemes are separated by white space, consisting of nonnull
sequences of blanks, tabs, and newlines.
 Our lexical analyzer will strip out white space. It will do so by comparing a string
against the regular definition ws, below.

 If a match for ws is found, the lexical analyzer does not return a token to the parser.
 It is the following token that gets returned to the parser.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
1.2.4.2 Transition Diagram
It is a directed labelled graph consisting of nodes and edges. Nodes represent states, while edges
represent state transitions.

1.2.5 STRUCTURE OF THE GENERATOR ANALYZER


 The program that serves as the lexical analyzer includes a fixed program that
simulates an automaton; at this point we leave open whether that automaton is
deterministic or nondeterministic.
 The rest of the lexical analyzer consists of components that are created from the Lex
program by Lex itself.
 These components are:
 A transition table for the automaton.
 Those functions that are passed directly through Lex to the output.
 The actions from the input program, which appear as fragments of code to be
invoked at the appropriate time by the automaton simulator.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
ARCHITECTURE OF GENERATOR ANALYZER

Pattern Matching Based on NFA's


 Transition table is constructed by Non-deterministic Automata.
 We begin by taking each regular-expression pattern in the Lex program and convert
to NFA.
 We need a single automaton that will recognize lexemes matching any of
the patterns in the program, so we combine all the NFA's in to one by introducing a
new start state with e-transitions to each of the start states of the NFA's N{ for
pattern pi}.

Read input beginning at the point on its input which we have referred to as
lexemeBegin. As it moves the pointer called forward ahead in the input, it calculates the set of
states it is in at each point.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Eventually, the NFA simulation reaches a point on the input where there are no next states. At that
point, there is no hope that any longer prefix of the input would ever get the NFA to an
accepting state; rather, the set of states will always be empty. Thus, we are ready to decide on the
longest prefix that is a lexeme matching some pattern.

DFA'S FOR LEXICAL ANALYZERS

Architecture, resembling the output of Lex, is to convert the NFA for all the patterns into an
equivalent DFA, using the subset construction method.
The accepting states are labelled by the pattern that is identified by that state.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

LOOKAHEAD OPERATOR
 The Lex lookahead operator / in a Lex pattern r1/r2 is sometimes necessary, because
the pattern r1 for a particular token may need to describe some trailing context r2 in
order to correctly identify the actual lexeme.
 When converting the pattern r1/r2 to an NFA, we treat the / as if it were e, so we do
not actually look for a / on the input. However, if the NFA recognizes a prefix xy of
the input buffer as matching this regular expression, the end of the lexeme is not where the
NFA entered its accepting state.
AN NFA FOR THE PATTERN FOR THE FORTRAN IF WITHLOOKAHEAD

Notice that the e-transition from state 2 to state 3 represents the lookahead operator. State
6 indicates the presence of the keyword IF. However, we find the lexeme IF by scanning
backwards to the last occurrence of state 2, whenever state 6 is entered.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
1.2.6 Optimization of DFA-Based Pattern Matchers
To optimize the DFA you have to follow the various steps. These are as follows:

Step 1: Remove all the states that are unreachable from the initial state via any set of the
transition of DFA.

Step 2: Draw the transition table for all pair of states.

Step 3: Now split the transition table into two tables T1 and T2. T1 contains all final states
and T2 contains non-final states.

Step 4: Find the similar rows from T1 such that:

1. δ (q, a) = p
2. δ (r, a) = p
That means, find the two states which have same value of a and b and remove one of them.

Step 5: Repeat step 3 until there is no similar rows are available in the transition table T1.

Step 6: Repeat step 3 and step 4 for table T2 also.

Step 7: Now combine the reduced T1 and T2 tables. The combined transition table is the
transition table of minimized DFA.

Solution:
Step 1: In the given DFA, q2 and q4 are the unreachable states so remove them.

Step 2: Draw the transition table for rest of the states.


KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES

Step 3:

Now divide rows of transition table into two sets as:

1. One set contains those rows, which start from non-final sates:

2. Other set contains those rows, which starts from final states.

Step 4: Set 1 has no similar rows so set 1 will be the same.

Step 5: In set 2, row 1 and row 2 are similar since q3 and q5 transit to same state on 0 and 1.
So skip q5 and then replace q5 by q3 in the rest.

Step 6: Now combine set 1 and set 2 as:


KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Now it is the transition table of minimized DFA.

Transition diagram of minimized DFA:

1.2.6 The Lexical-Analyzer Generator (LEX) tool

LEX in compiler design is a tool that generates lexical analyzers, which are programs that
convert streams of characters into meaningful units called tokens. This process is known as
tokenization and is a key part of lexical analysis, the first phase of a compiler's workflow.

Key Features of Lexical Analysis:

1. Tokenization: Converts input character streams into tokens like identifiers,


separators, keywords, constants, operators, etc.
2. Error Detection: Identifies lexical errors, such as unmatched strings or length
violations.
3. Comment Elimination: Removes spaces, blank lines, and comments for simplified
processing.

What is Lex? Lex is a specialized tool (or program) that automates the generation of lexical
analyzers. It takes input in the form of Lex source programs (File.l) and produces C programs
(lex.yy.c) as output. The generated C program can be compiled using a standard C compiler,
resulting in a lexical analyzer (a.out), which converts character streams into tokens.

Functions of Lex:

1. Takes File.l (written in Lex syntax) as input and generates lex.yy.c.


2. Compiling lex.yy.c with a C compiler produces an executable (a.out), which performs
token generation.
3. Works in conjunction with YACC (Yet Another Compiler Compiler) for parser
generation.
KKR & KSR INSTITUTE OF TECHNOLOGY & SCIENCES
Originally created by Mike Lesk and Eric Schmidt, Lex remains widely used for compiler
design due to its efficiency in simplifying lexical analysis tasks

A Lex file is structured into three main sections, separated by % delimiters:

1. Declarations: This section includes the declaration of variables, constants, or other


necessary components used in the Lex file.
2. Translation Rules: This part consists of patterns and their corresponding actions.
Patterns define the input to be matched, and actions specify what to do when a
pattern is found.
3. Auxiliary Procedures: This section contains auxiliary functions that support the
actions specified in the translation rules.

You might also like