0% found this document useful (0 votes)
5 views

PL Lec 2 Syntax and Semantics

This document covers the concepts of syntax and semantics in programming, explaining their differences and the compilation process from source code to machine language. It discusses lexical analysis, tokenization, parsing, and error detection in compilers, as well as the role of regular expressions in pattern matching. Additionally, it addresses semantic analysis and common semantic errors encountered during program execution.

Uploaded by

Maxine Botos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

PL Lec 2 Syntax and Semantics

This document covers the concepts of syntax and semantics in programming, explaining their differences and the compilation process from source code to machine language. It discusses lexical analysis, tokenization, parsing, and error detection in compilers, as well as the role of regular expressions in pattern matching. Additionally, it addresses semantic analysis and common semantic errors encountered during program execution.

Uploaded by

Maxine Botos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Lesson 2 – Describing

Syntax and Semantics


Differentiate syntax and semantics
Explain the process/phases of compiling a
source code
Perform a tokenization in java.
Syntax – the form of the expressions, statements, and
program units
Semantics - the meaning of the expressions, statements,
and program units.
 Ex: the syntax of a Java while statement is

while (boolean_expr) statement

The semantics of this statement form is that when the


current value of the Boolean expression is true, the
embedded statement is executed. – The form of a
statement should strongly suggest what the statement is
meant to accomplish
 A lexeme is the lowest level syntactic unit of a language. It includes
identifiers, literals, operators, and special word (e.g. *, sum, begin).
 A program is strings of lexemes.
 A token is a category of lexemes (e.g., identifier).
 An identifier is a token that have lexemes, or instances, such as sum and
total.
 Ex: index = 2 * count + 17;
 Lexemes Tokens
 index identifier
 = equal_sign
 2 int_literal
 * mult_op
 count identifier
 + plus_op
 17 int_literal
 ; semicolon
What Happens When You Run a Program

 A computer doesn’t actually understand the phrase ‘Hello, world!’, and it


doesn’t know how to display it on screen. It only understands on and off. So to
actually run a command like print 'Hello, world!', it has to translate all the code
in a program into a series of ons and offs that it can understand.

 To do that, a number of things happen:


 The source code is translated into assembly language.
 The assembly code is translated into machine language.
 The machine language is directly executed as binary code.

 The coding language first has to translate its source code into assembly
language, a super low-level language that uses words and numbers to
represent binary patterns. Depending on the language, this may be done with
an interpreter (where the program is translated line-by-line), or with a compiler
(where the program is translated as a whole).
 The coding language then sends off the assembly code to the computer’s
assembler, which converts it into the machine language that the computer can
understand and execute directly as binary code.
Lexical Analysis

Lexical analysis is the first phase of a


compiler. It takes the modified source code
from language preprocessors that are written
in the form of sentences. The lexical analyzer
breaks these syntaxes into a series of tokens,
by removing any whitespace or comments in
the source code.
If the lexical analyzer finds a token invalid, it
generates an error. The lexical analyzer
works closely with the syntax analyzer. It
reads character streams from the source
code, checks for legal tokens, and passes the
data to the syntax analyzer when it demands.
Tokens

Lexemes are said to be a sequence of


characters (alphanumeric) in a token. There
are some predefined rules for every lexeme to
be identified as a valid token. These rules are
defined by grammar rules, by means of a
pattern. A pattern explains what can be a
token, and these patterns are defined by
means of regular expressions.
What is a regular expression?

A regular expression is a sequence of characters


that forms a search pattern. When you search for
data in a text, you can use this search pattern to
describe what you are searching for.
A regular expression can be a single character, or a
more complicated pattern.
Regular expressions can be used to perform all
types of text search and text replace operations.
Java does not have a built-in Regular Expression
class, but we can import the java.util.regex package
to work with regular expressions. The package
includes the following classes:

•Pattern Class - Defines a pattern (to be used in a search)


•Matcher Class - Used to search for the pattern
•PatternSyntaxException Class - Indicates syntax error in a
regular expression pattern
Find out if there are any occurrences of the word
"w3schools" in a sentence:
Flags

Flags in the compile() method change how the search is performed.


Here are a few of them:

•Pattern.CASE_INSENSITIVE - The case of letters will be ignored


when performing a search.
•Pattern.LITERAL - Special characters in the pattern will not
have any special meaning and will be treated as ordinary
characters when performing a search.
•Pattern.UNICODE_CASE - Use it together with
the CASE_INSENSITIVE flag to also ignore the case of letters
outside of the English alphabet
Example Explained
In this example, The word "w3schools" is being searched for in a
sentence.
First, the pattern is created using
the Pattern.compile() method. The first parameter indicates
which pattern is being searched for and the second parameter
has a flag to indicates that the search should be case-insensitive.
The second parameter is optional.
The matcher() method is used to search for the pattern in a
string. It returns a Matcher object which contains information
about the search that was performed.
The find() method returns true if the pattern was found in the
string and false if it was not found.
Regular Expression Patterns

The first parameter of the Pattern.compile() method is the pattern. It describes


what is being searched for.

Brackets are used to find a range of characters:


Metacharacters

Metacharacters are characters with a special


meaning:
In programming language, keywords,
constants, identifiers, strings, numbers,
operators and punctuations symbols can be
considered as tokens.
For example, in C language, the variable
declaration line
Specifications of Tokens

Let us understand how the language theory


undertakes the following terms:

Alphabets
Any finite set of symbols {0,1} is a set of
binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of
Hexadecimal alphabets, {a-z, A-Z} is a set of
English language alphabets.
Strings
Any finite sequence of alphabets is called a
string. Length of the string is the total
number of occurrence of alphabets, e.g., the
length of the string tutorialspoint is 14 and is
denoted by |tutorialspoint| = 14. A string
having no alphabets, i.e. a string of zero
length is known as an empty string and is
denoted by ε (epsilon).
Special Symbol
A typical high-level language contains the following
symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%),
Multiplication(*), Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(-
>)
Assignment =
Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=


Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<
Longest Match Rule

 When the lexical analyzer read the source-code, it scans the


code letter by letter; and when it encounters a whitespace,
operator symbol, or special symbols, it decides that a word is
completed.
 For example:
 While scanning both lexemes till ‘int’, the lexical analyzer
cannot determine whether it is a keyword int or the initials
of identifier int value.
 The Longest Match Rule states that the lexeme scanned
should be determined based on the longest match among all
the tokens available.
 The lexical analyzer also follows rule priority where a
reserved word, e.g., a keyword, of a language is given
priority over user input. That is, if the lexical analyzer finds a
lexeme that matches with any existing reserved word, it
should generate an error.
Syntax Analysis / Parsing

Syntax analysis or parsing is the second phase of a


compiler. In this chapter, we shall learn the basic
concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify
tokens with the help of regular expressions and
pattern rules. But a lexical analyzer cannot check
the syntax of a given sentence due to the limitations
of the regular expressions. Regular expressions
cannot check balancing tokens, such as parenthesis.
Therefore, this phase uses context-free grammar
(CFG), which is recognized by push-down automata.
CFG, on the other hand, is a superset of Regular
Grammar, as depicted below:

It implies that every Regular Grammar is also


context-free, but there exists some problems,
which are beyond the scope of Regular Grammar.
CFG is a helpful tool in describing the syntax of
programming languages.
Context-Free Grammar

We will first see the definition of context-free grammar


and introduce terminologies used in parsing technology.
A context-free grammar has four components:
 A set of non-terminals (V). Non-terminals are syntactic variables
that denote sets of strings. The non-terminals define sets of strings
that help define the language generated by the grammar.
 A set of tokens, known as terminal symbols (Σ). Terminals are the
basic symbols from which strings are formed.
 A set of productions (P). The productions of a grammar specify the
manner in which the terminals and non-terminals can be combined
to form strings. Each production consists of a non-terminal called
the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.
 One of the non-terminals is designated as the start symbol (S); from
where the production begins.
The strings are derived from the start symbol
by repeatedly replacing a non-terminal
(initially the start symbol) by the right side of
a production, for that non-terminal
Syntax Analyzers
A syntax analyzer or parser takes the input
from a lexical analyzer in the form of token
streams. The parser analyzes the source code
(token stream) against the production rules
to detect any errors in the code. The output
of this phase is a parse tree.
This way, the parser accomplishes two tasks,
i.e., parsing the code, looking for errors and
generating a parse tree as the output of the
phase.
Derivation

A derivation is basically a sequence of


production rules, in order to get the input
string. During parsing, we take two decisions
for some sentential form of input:
 Deciding the non-terminal which is to be replaced.
 Deciding the production rule, by which, the non-
terminal will be replaced.
To decide which non-terminal to be replaced
with production rule, we can have two
options.
Left-most Derivation
 If the sentential form of an input is scanned and

replaced from left to right, it is called left-most


derivation. The sentential form derived by the left-
most derivation is called the left-sentential form.
Right-most Derivation
 If we scan and replace the input with production
rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-
most derivation is called the right-sentential form.
Input string: id + id * id

Note: Follow MDAS


 Input = id + id * id
 Left Most Derivation
 E -> E + E
 E -> E * E
 E -> id

 N , T , P ,S

 E->E+E
 -> id + E
 -> id + E * E
 -> id + id * E
 -> id + id * id
 Input = id + id * id
 Right Most Derivation
 E -> E + E
 E -> E * E
 E -> id

E->E+E
-> E + E * E
-> E + E * id
-> E + id * id
-> id + id * id

Input 1: id + id * id / id
Left Most Derivation:

Input 2 : id * (id + id) - id


Right Most Derivation
Parse Tree

A parse tree is a graphical depiction of a


derivation. It is convenient to see how strings
are derived from the start symbol. The start
symbol of the derivation becomes the root of
the parse tree. Let us see this by an example
from the last topic.
We take the left-most derivation of a + b * c
Error Detection

 A parser should be able to detect and report any error


in the program. It is expected that when an error is
encountered, the parser should be able to handle it and
carry on parsing the rest of the input. Mostly it is
expected from the parser to check for errors but errors
may be encountered at various stages of the
compilation process. A program may have the following
kinds of errors at various stages:
 Lexical : name of some identifier typed incorrectly
 Syntactical : missing semicolon or unbalanced parenthesis
 Semantical : incompatible value assignment
 Logical : code not reachable, infinite loop
 There are four common error-recovery strategies that
can be implemented in the parser to deal with errors in
the code.
Panic mode
When a parser encounters an error anywhere
in the statement, it ignores the rest of the
statement by not processing input from
erroneous input to delimiter, such as semi-
colon. This is the easiest way of error-
recovery and also, it prevents the parser from
developing infinite loops.
Statement mode
When a parser encounters an error, it tries to
take corrective measures so that the rest of
inputs of statement allow the parser to parse
ahead. For example, inserting a missing
semicolon, replacing comma with a semicolon
etc. Parser designers have to be careful here
because one wrong correction may lead to an
infinite loop.
Error productions
Some common errors are known to the
compiler designers that may occur in the
code. In addition, the designers can create
augmented grammar to be used, as
productions that generate erroneous
constructs when these errors are
encountered.
Global correction
The parser considers the program in hand as
a whole and tries to figure out what the
program is intended to do and tries to find
out a closest match for it, which is error-free.
When an erroneous input (statement) X is
fed, it creates a parse tree for some closest
error-free statement Y. This may allow the
parser to make minimal changes in the
source code, but due to the complexity (time
and space) of this strategy, it has not been
implemented in practice yet.
Semantics Analysis

Semantics of a language provide meaning to


its constructs, like tokens and syntax
structure. Semantics help interpret symbols,
their types, and their relations with each
other. Semantic analysis judges whether the
syntax structure constructed in the source
program derives any meaning or not.
Semantic Errors
We have mentioned some of the semantics
errors that the semantic analyzer is expected
to recognize:
 Type mismatch
 Undeclared variable
 Reserved identifier misuse.
 Multiple declaration of variable in a scope.
 Accessing an out of scope variable.
 Actual and formal parameter mismatch.
By runtime, we mean a program in execution.
Runtime environment is a state of the target
machine, which may include software libraries,
environment variables, etc., to provide services to
the processes running in the system.
Runtime support system is a package, mostly
generated with the executable program itself and
facilitates the process communication between
the process and the runtime environment. It takes
care of memory allocation and de-allocation while
the program is being executed.

You might also like