CD Mini Project Lexical Analyzer
CD Mini Project Lexical Analyzer
Gudied by:
DR.KOWSIGAN
Submitted By:
IKLASH KHAN(RA2011003011391)
SWATI ANAND(RA2011003011383)
SIDDHARTH SINGH(RA2011003011395)
AIM: LEXICAL ANALYZER FOR C LANGUAGE
ABSTRACT:-
A compiler is a special program that processes statements written in
a particular programming language and turns them into machine language
or code that a computer's processors use. The file used for writing a C-
language contains what are called the source statements. The programmer
then runs the appropriate language compiler, specifying the name of the
file that contains the source statements. When executing, the compiler first
parses all of the language statements syntactically one after the other and
then, in one or more successive stages, builds the output code, making sure
that statements that refer to other statements are referred to correctly in
the final code. The output of the compilation is called object code or
sometimes an object module. Lexical analysis is the first phase of a
compiler. It takes the modified source code from language preprocessors
that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in
the source code. Symbol table is an important data structure created and
maintained by compilers in order to store information about the
occurrence of various entities such as variable names, function names, etc.
Symbol table is used by both the analysis and the synthesis parts of a
compiler. We have designed a lexical analyzer for the C language using lex.
It takes as input a C code and outputs a stream of tokens. The tokens
displayed as part of the output include keywords, identifiers,
signed/unsigned integer/floating point constants, operators, special
characters, headers, data-type specifiers, array, single-line comment, multi-
line comment, preprocessor directive, pre-defined functions (printf and
scanf), user-defined functions and the main function. The token, the type of
token and the line number of the token in the C code are being displayed.
The line number is displayed so that it is easier to debug the code for
errors. Errors in single-line comments, multi-line comments are displayed
along with line numbers. The output also contains the symbol table which
contains tokens and their type. The symbol table is generated using the hash
organisation.
REQUIREMENT TO RUNS THE SCRIPT:
COMPILER DESIGN PHASES
A compiler is a special program that processes statements written in a
particular programming language and turns them into machine
language or code that a computer's processors use. A compiler can
broadly be divided into two phases based on the way theycompile.
➢ Lexical Analysis
➢ Syntax Analysis
➢ Semantic Analysis
➢ Intermediate Code Generation
➢ Code Optimization
➢ Code Generator
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes the modified source
code from language preprocessors that are written in the form of sentences.
The lexical analyzer breaks these syntaxes into a series of tokens, by
removing any whitespace or comments in the sourcecode.
If the lexical analyzer finds a token invalid, it generates an error. The
lexical analyzer works closely with the syntax analyzer. It reads character
streams from the source code, checks for legal tokens, and passes the data
to the syntax analyzer when it demands.
Syntax Analysis
Syntax analysis or parsing is the second phase of a compiler. It takes the
token produced by lexical analysis as input and generates a parse tree (or
syntax tree).
Semantic Analysis
Semantic analysis is the third phase of a compiler. Semantic analyzer
checks whether the parse tree constructed by the syntax analyzer follows
the rules of language.
Code Optimization
In this phase, code optimization of the intermediate code is done.
Optimization can be assumed as something that removes unnecessary code
lines, and arranges the sequence of statements in order to speed up the
program execution without wasting resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language.
THE LEXICAL ANALYSIS:
In computer science, lexical analysis is the process of converting a sequence
of characters (such as in a computer program or web page) into a sequence
of tokens (strings with an identified "meaning"). A program that performs
lexical analysis may be called a lexer, tokenizer, or scanner (though
"scanner" is also used to refer to the first stage of a lexer). Such a lexer is
generally combined with a parser, which together analyze the syntax of
programming languages, web pages, and so forth.
The script written by us is a computer program called the “lex” program,
is the one that generates lexical analyzers ("scanners" or "lexers"). Lex
reads an input stream specifying the lexical analyzer and outputs source
code implementing the lexer in the C programming language.
The structure of the lex program consists of three sections:
{definition section}
%%
{rules section}
%%
{C code section}
The definition section defines macros and imports header files written in C.
It is also possible to write any C code here, which will be copied verbatim
into the generated source file.
The rules section associates regular expression patterns with C statements.
When the lexer sees text in the input matching a given pattern, it will
execute the associated C code.
The C code section contains C statements and functions that are copied
verbatim to generated source file. These statements presumably contain
code called by the rules in the rules section. In large programs it is more
convenient to place this code in a separate file linked in at compile time.
The lex program, when compiled using the lex command, generates a file
called lex.yy.c, which when executed recognizes the tokens present in the
input C program.
Lexical analysis only takes care of parsing the tokens and identifying their
type. The output of this phase is the stream of tokens as well as the symbol
table representing the tokens and their type.
Code:
%{
int lineno = 1;
#include<stdio.
h>
#include<stdlib.
h>
#include<string.
h>
#define AUTO 1
#define BREAK 2
#define CASE 3
#define CHAR 4
#define CONST 5
#define CONTINUE 6
#define DEFAULT 7
#define DO 8
#define DOUBLE 9
#define ELSE 10
#define ENUM 11
#define EXTERN 12
#define FLOAT 13
#define FOR 14
#define GOTO 15
#define IF 16
#define INT 17
#define LONG 18
#define REGISTER 19
#define RETURN 20
#define SHORT 21
#define SIGNED 22
#define SIZEOF 23
#define STATIC 24
#define STRUCT 25
#define SWITCH 26
#define TYPEDEF 27
#define UNION 28
#define UNSIGNED 29
#define VOID 30
#define VOLATILE 31
#define WHILE 32
#define IDENTIFIER 33
#define SLC 34
#define MLCS 35
#define MLCE 36
#define LEQ 37
#define GEQ 38
#define EQEQ 39
#define NEQ 40
#define LOR 41
#define LAND 42
#define ASSIGN 43
#define PLUS 44
#define SUB 45
#define MULT 46
#define DIV 47
#define MOD 48
#define LESSER 49
#define GREATER 50
#define INCR 51
#define DECR 52
#define SEMI 54
#define HEADER 55
#define MAIN 56
#define PRINTF 57
#define SCANF 58
#define DEFINE 59
#define INT_CONST 60
#define FLOAT_CONST 61
#define TYPE_SPEC 62
#define DQ 63
#define OBO 64
#define OBC 65
#define CBO 66
#define CBC 67
#define HASH 68
#define ARR 69
#define FUNC 70
#define NUM_ERR 71
#define UNKNOWN 72
#define CHAR_CONST 73
#define STRING_CONST 75
%}
alpha [A-
Za-z]digit
[0-9] und
[_]
space [
] tab [ ]
line
[\n]
char
\'.\' at
[@]
string \"(.^([%d]|[%f]|[%s]|[%c]))\"
%%
{space}* {}
{tab}* {}
return CHAR;
const return
CONST;
continue return
CONTINUE;default
return DEFAULT;
do return DO;
double return
DOUBLE;else
return ELSE;
enum return
ENUM; extern
return EXTERN;
float return
FLOAT; for return
FOR;
goto return
GOTO;if return
IF;
int return INT;
long return
LONG;
register return
REGISTER;return
return RETURN;
short return SHORT;
signed return
SIGNED; sizeof
return SIZEOF; static
return STATIC; struct
return STRUCT;
switch return
SWITCH; typedef
return TYPEDEF;
union return UNION;
unsigned return
UNSIGNED;void return
VOID;
volatile return
VOLATILE;while
return WHILE;
printf return
PRINTF;scanf
return SCANF;
"<=" return
LEQ; ">="
return GEQ;
"==" return
EQEQ; "!="
return NEQ;
"||" return
LOR; "&&"
return LAND;
"=" return
ASSIGN;"+"
return PLUS;
"-" return SUB;
"*" return
MULT; "/"
return DIV;
"%" return
MOD; "<"
return LESSER;
">" return
GREATER;"++"
return INCR;
"--" return DECR;
"," return COMMA;";" return SEMI;
"#include<stdio.h>" return
HEADER; "#include <stdio.h>"
return HEADER; "main()"
return MAIN;
"%d"|"%f"|"%u"|"%s" return
TYPE_SPEC;"\"" return DQ;
"(" return OBO;
")" return OBC;
"{" return CBO;
"}" return
CBC; "#"
return HASH;
struct node
{
char token[100];
char
attr[100];struct
node *next;
};
struct hash
{
struct node
int
hi=0;
int l,i;
for(i=0;token[i]!='\0';i++)
{
hi = hi + (int)token[i];
}
hi = hi%eleCount;return hi;
}
int
flag=0;
int hi;
hi = hashIndex(token);
struct node *newnode = createNode(token, attr);
/* head of list for the bucket with index "hashIndex" */
struct node
*myNode;int i,j,
k=1;
printf(" -------------------------------------------------------");
printf("\nSNo \t|\tToken \t\t|\tToken Type \t\n");
printf("------------------------------------------------------- \n");
for (i = 0; i < eleCount; i++)
{
if (hashTable[i].count
== 0)continue;
myNode =
hashTable[i].head;if
(!myNode)
continue;
while (myNode != NULL)
{
}
}
return;
}
printf("%d\t\t", k++); printf("%s\t\t\t",
myNode->token);printf("%s\t\n", myNode-
>attr); myNode = myNode->next;
if (hashTable[hi].head==NULL)
{
hashTable[hi].head =
newnode;
hashTable[hi].count = 1;
return;
}
if (strcmp(myNode->token, token)==0)
{
flag =
1;
break;
}
myNode = myNode->next;
}
if(!flag)
{
//adding new node to the list
newnode->next = (hashTable[hi].head);
//update the head of the list and no of nodes in the
current buckethashTable[hi].head = newnode;
hashTable[hi].count++;
}
return;
}
void display()
{
*head;int count;
};
struct hash
hashTable[1000];int
eleCount = 1000;
{
if(lineno == slcline)
{
scan =
yylex();
continue;
}
if(dq%2!=0)
printf("\n******** ERROR!! INCOMPLETE STRING at Line
%d
********\n\n", dqline);
dq=0;
}
if((scan>=1 && scan<=32) && mlc==0)
{
printf("%s\t\t\tKEYWORD\t\t\t\tLine %d\n",
yytext, lineno);insertToHash(yytext,
"KEYWORD");
}
printf("%s\t\t\tIDENTIFIER\t\t\tLine %d\n",
yytext, lineno);insertToHash(yytext,
"IDENTIFIER");
}
if(scan==34)
{
mlc = 0;
printf("%s\t\t\tMultiline Comment End\t\tLine %d\n", yytext,
lineno);
}
printf("%s\t\t\tOPERATOR\t\t\tLine %d\n",
yytext, lineno);insertToHash(yytext,
"OPERATOR");
}
dq++;
dqline = lineno;
}
{
printf("%s\tHEADER\t\t\t\tLine %d\n",yytext, lineno);
}
}
if(scan==69 && mlc==0)
{
printf("%s\t\t\tARRAY\t\t\t\tLine %d\n",
yytext, lineno);insertToHash(yytext, "ARRAY");
}
}
if(scan==75 && mlc==0)
{
scan = yylex();
}
if(mlc==1)
printf("\n******** ERROR!! UNMATCHED COMMENT STARTING at Line
%d
********\n\n",mlcli
ne);printf("\n");
printf("\n\t******** SYMBOL TABLE
********\t\t\n");display();
printf("------------------------------------------------------- \n\n");
}
int yywrap()
{
return 1;
Output:
Input file (isPrime.c)
#include<stdio
.h>int main()
{
int a,i,flag=0;
printf("Input
no");
scanf("%d",&a)
; i=2;
while(i <= a/2)
{
if(a%i == 0)
{
flag=
1;
} break
i++ ;
;
}
if(flag==0)
printf("%d Prime", a);
return 0;
}
OUTPUT:
SYMBOL TABLE:
RESULT:
The task of the lexical program is to read characters one by one from the
program and analyse the character stream to distinguish words in the
program. The word refers to the set of characters which have compact logical
relationship and have collective meaning called token. This is the process is
tokenization. To adapt to the multi-core environment the process of
tokenization has to be parallelised. This is achieved by exploiting the parallel
constructs of the languages.