PCD Notes - Unit - 1
PCD Notes - Unit - 1
Unit I Syllabus : Unit I : LEXICAL ANALYSIS Introduction to Compiling- Compilers-Analysis of the source program- The phasesCousins-The grouping of phases-Compiler construction tools. The role of the lexical analyzer- Input buffering-Specification of tokens-Recognition of tokens-A language for specifying lexical analyzer.
Compiler :
A Compiler is a program that reads a program written in one language (Source Language like C,C++,etc) and translate it into an equivalent program in another language (Target Language like Machine Language) and the Compiler reports to its user the presence of errors in the source program. Source Program (High Level Language) Compiler Target Program (Low Level Language)
Error Message Classification of Compiler : 1. Single Pass Compiler (narrow) - traverse the source program in only once. Faster, has limited scope of passes, eg. Pascal 2. Multi-Pass Compiler (wide) processes the source program in several times. Slower, has wide scope of passes, eg. Java 3. Load and Go Compiler generates machine code and then immediately executes it. 4. Debugging or Optimizing Compiler - tries to minimize or maximize some attributes of an executable computer program
Software Tools :
Many software tools that manipulate source programs first perform some kind of analysis. Some examples of such tools include: Structure Editors : A structure editor takes as input a sequence of commands to build a source program. The structure editor not only performs the text-creation and modification functions of an ordinary text editor, but it also analyzes the program text, putting an appropriate hierarchical structure on the source program. Example while . do and begin.. end. Pretty printers : A pretty printer analyzes a program and prints it in such a way that the structure of the program becomes clearly visible. 1 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I Static Checkers : A static checker reads a program, analyzes it, and attempts to discover potential bugs without running the program. Interpreters : Translate from high level language ( BASIC, FORTRAN, etc..) into assembly or machine language. Interpreters are frequently used to execute command language, since each operator executed in a command language is usually an invocation of a complex routine such as an editor or Compiler. The analysis portion in each of the following examples is similar to that of a conventional Compiler. Text formatters. Silicon Compiler. Query interpreters.
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I Semantic analysis : In this phase checks the source program for semantic errors and gathers type information for subsequent code generation phase. An important component of semantic analysis is type checking. Example : int to real conversion. Expression | * Expression | identifier | rate Expression | number | inttoreal | 60
Phases of Compiler:
A Compiler operates in phases, each of which transforms the source program from one representation to another. Two parts (Six Phases) of compilation. They are, Analysis Phase ( Three Phases) Lexical Analysis Syntax Analysis Semantic Analysis Synthesis Phase ( Three Phases) Intermediate Code Generation Code Optimizer Code Generator Two other activities are Symbol Table Management Error Handler Lexical Analysis : It is also called scanner. The lexical analysis phase reads the characters in the source program and grouped into them tokens that are sequence of characters having a collective meaning. Such as an Identifier, a Keyword, a Punctuation, an operator or multi character operator like ++. The character sequence forming a token is called the lexeme for the token. Certain tokens will be augmented by a lexical value. Example : position : = initial + rate * 60 id1 := id2 + id3 * 60 Blanks eliminated. 3 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I
Semantic Analyzer
Code Generator
Target Program Syntax analysis: It processes the string of descriptors (tokens), synthesized by the lexical analyzer to determine the syntactic structure of an input statement. This process is known as parsing. Output of the parsing step is a representation of the syntactic structure of a statement. A convenient representation is in the form of a syntax tree. Example : position : = initial + rate * 60 := id1 id2 id3 Semantic analysis : In this phase checks the source program for semantic errors and gathers type information for subsequent code generation phase. 4 / 15 +
*
60
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I An important component of semantic analysis is type checking. Example : int to real conversion. := id1 id2 id3 +
*
inttoreal | 60
Intermediate Code Generation: It should be easy to produce. It should be easy to translate into the target program. Three address codes consist of a sequence of instructions, each of which has at most three operands. Example id1 := id2 + id3 * 60 Three address code as temp1 := inttoreal (60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 Code Optimization: To improve the intermediate code, so that faster running machine code will result. Example Three address code after optimization as temp1 := id3 * 60.0 id1 := id2 + temp1 Code Generation: Final phase of the Compiler is the generation of target code, consisting or relocatable machine code or assembly code. Example for 8086 conversion code MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1
5 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I
Symbol Table Management: A Symbol table is data structure containing a record for each identifier with fields for the attributes of an identifier. When an identifier in the source program is detected by the lexical analyzer, the identifier is entered into the symbol table. However, the attributes of an identifier cannot normally be determined during lexical analyzer. The remaining phases enter information about identifiers into the symbol table and then use this information in various ways. Error Handler: Each phase can encounted errors The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. The syntax analysis phase can detect errors where the token stream violates the structure rules of language. During semantic analysis, the compiler tries to detect construct that have the right syntactic structure but no meaning to the operation involved. An intermediate code generator may detect an operator whose operands have incompatible. The code optimizer, doing control flow analysis may detect that certain statements can never be reached. While entering information into the symbol table, the book keeping routine may discover an identifier that has been multiply declared with contradicting attributes.
6 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I
Write down the output of each phase for expression position : = initial + rate * 60 Source Program position : = initial + rate * 60
*
60
Semantic Analyzer
:= id1 + id2 id3
*
inttoreal | 60
Error Handler
Code Optimizer
temp1 := id3 * 60.0 id1 := id2 + temp1
Code Generator
Target Program
MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1
7 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I
Absolute Machine Code Preprocessors : It produces input to Compiler. They may perform the following functions. Macro Processing : A preprocessor may allow a user to define macros that are shorthands for longer constructs. File inclusion : A preprocessor may include header files into the program text. Rational preprocessors : These preprocessors augment older language with more modern flow of control and data structuring facilities. Language extensions : These preprocessor attempts to add capabilities to the language by what amounts to built in macros. Compiler : It converts the source program (HLL) into target program (LLL). Assembler : It converts an assembly language (LLL) into machine code. Loader and Link Editors : Loader : The process of loading consists of taking relocatable machine code, altering the relocatable addresses and placing the altered instructions and data in memory at the proper locations. Link Editor : It allows us to make a single program from several files of relocatable machine code. 8 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I
Grouping of Compiler :
A Symbol table is data structure containing a record for each identifier with fields for the attributes of an identifier. When an identifier in the source program is detected by the lexical analyzer, the
9 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I
10 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I
Symbol table Management Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis. Receiving a get next token command from the parser, the lexical analyzer reads input characters until it can identify the next token. Its secondary takes are, 1. One task is stripping out from the source program comments and while space in the form of blank, tab, new line characters. 2. Another task is converting error messages from the compiler with the source program. Two phases 1. Scanning 2. Lexical analysis The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the more complex operations. FUNCTIONS: 1. It produces the stream of tokens. 2. It eliminates blank and commands. 3. It generates symbol table which stores the information about ID, constants encountered in the input. 4. It keeps track of line number. 5. It reports the error encountered while interrupting the tokens. ISSUES IN LEXICAL ANALYSIS: There are several reasons for separating the analysis phase of compiling into lexical analysis and parsing. Simpler design. Compiler efficiency is improved. Compiler portability is enhanced. 11 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I
TOKEN: It is a sequence of character that can be treated as a single logical entity. Typical tokens are, 1. Identifiers 2. Keywords 3. Operators 4. Special symbols 5. Constants PATTERN: A set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. LEXEME: It is sequence of characters in the source program that is matched by the pattern foe a token.
INPUT BUFFERING :
During the analysis, the scanner scans the input string from left to right one character at a time to identify tokens. It uses two pointers for doing this analysis 1. Begin pointer (to keep track of first character for each token). 2. Forward pointer(to keep track of next character) bp f fp Steps in Scanning the Input: 1. Initially, both begin pointer and forward pointer points to the first character of the lexeme. 2. The fp scans the buffer until there is a match with the described token is found. 3. Once the lexeme is found (either a space or a delimiter), the fp will represent the right end to the lexeme. l o a t a , b ; a = A + 2 ;
12 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I bp f l o a t fp 4. After processing the lexeme, both pointers will be set to point the character immediately after the lexeme. bp f l o a t a fp 5. This procedure is represented for the entire source program. Input strings are usually stored in buffer. Two Types: 1. One buffer scheme 2. Two buffer scheme One Buffer Scheme: Only one buffer of size N is used. First N characters of the input string are read into the buffer. When the fp reaches the end into the buffer, it will be filled with the next set of N characters. Drawbacks: The problem with this implementation is that when the size of the token is greater than N this scheme fails to produce the tokens. Two Buffer Scheme: bp f L o fp First half N Size Two N character buffers are used. 13 / 15 Second half N Size a t eof a , b ; a = a + 2 eof , b ; a = A + 2 ; a , b ; a = A + 2 ;
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I First N characters are read into the first half of the buffer. If the buffer hasnt filled (<N) then a special character called EOF will be inserted to indicate the end. When the pointer reaches the end of first half, then the second half will be loaded with next N characters of the same program. When the pointer is about to reach the end of second half, then the first half will be loaded with next N characters of the input. Algorithm for advancing the fp : if fp is at the end of first half then begin Load second half; Increment fp by 1; end else if fp at the end of second half then begin Load first half; Set fp to first character of first half; end else increment fp by 1; end Every time to check whether it has reached its end or not. To reduce the number of comparisons, a special character called sentinel character (usually EOF) is introduced at ends of the buffer halves.
Algorithm for advancing the fp using Sentinel: fp = fp+1; begin if fp = eof then if fp at the end of first half then begin Load second half; Fp by 1; end else if fp at the end of second half then begin Load first half; Set fp to first character of first half; end else Terminate lexical analysis; End 14 / 15
Anna University B.E -VI Sem CSE CS2352 Principles of Compiler Design
Unit I Refer the following from Theory of Computation 1. 2. 3. 4. 5. 6. 7. Finite Automata DFA NFA Regular Expression Converting R.E into NFA Converting NFA with into NFA and DFA Minimization of DFA.
15 / 15