M.
Tayyab
The PPL guide the design, implementation, and use of software system.
These principles ensure that a language is logical, efficient, and suitable
for its intended purpose.
Here are some key principles:
Abstraction: Simplifies complex concepts by allowing developers to
work at a higher level of logic Examples: Functions, classes, and
modules in programming languages.
Encapsulation:Groups related data and functions together.
Syntax: The grammar or rules that define the structure of programs.
Semantics: The meaning or behavior of the constructs in a language.
Paradigm Support: Programming languages often support one or
more paradigms, such as: Procedural (C),OOP(Java),Functional
(Haskell), Logic-Based (Prolog)
Readability: Encourages clear and understandable code through good
syntax design and conventions.
Reusability: Promotes the use of code across multiple projects or
parts of a program.
Performance: Balances execution speed, memory usage, and other
system resources for efficiency.
Portability: Code can run across the platforms with minimal changes.
Security: Protects against unauthorized access or data corruption
through features like access control and memory safety.
Concurrency: Supports the development of programs that can
execute multiple tasks simultaneously.
Algorithmic Thinking: The ability to solve problems by defining a
sequence of clear, unambiguous steps or instructions.
By adhering to these principles, programming languages evolve to be
robust, user-friendly, and efficient.
The lexical structure of outlines its basic code elements—vocabulary
and symbols. It forms the foundation for syntax and defines how source
code characters are grouped into meaningful sequences, like tokens.
Key components of the lexical structure include:
Characters: Set of valid characters, such as letters, digits, and
symbols.
Tokens: Smallest units of meaning, like if (a keyword)
Lexemes: These are the actual sequences of characters in the code
that form tokens. For instance, in the statement int age = 25;, the
lexemes are int, age,=, 25, and ;.
Whitespace: Spaces, tabs, and newlines, which often separate tokens
and structure code (e.g., Python uses indentation).
Comments: Non-executable code used for documentation
Escape Sequences: Special sequences in strings or characters e.g. \n
The lexical analyzer is responsible for scanning the source code and
converting it into a stream of tokens. This is the first phase of
compilation.
Key Functions:
1. Tokenization: Breaks down the input program into meaningful units
called tokens (e.g., keywords, identifiers, literals, operators).
2. Eliminates Noise: Removes irrelevant components like whitespace,
comments, and line breaks.
3. Error Detection: Detects lexical errors such as invalid symbols or
malformed tokens.
2.1 Tokenization and Regular Expressions
Tokenization (also called lexical analysis) is the process of breaking a
source code into tokens—the smallest meaningful units in a
programming language.
A lexer (tokenizer) scans the source code character by character and
matches sequences of characters against predefined regular expressions
to classify them into token types
Keywords (if, while, return) Numbers (42,3.14, 1000)
Identifiers (x, sum, myVariable) Punctuation (;, {, })
Operators (+, -, *, /, =)
Keywords are matched with a regex like \b(if|while|return)\b.
Identifiers are matched with something like [a-zA-Z_]\w*.
Numeric literals can be captured using a pattern like \d+(\.\d+)?.
Operators (+, -, *, etc.) might have their own regex pattern such as
(\+|\-|\*|\/)
Tokenization Process
Step 1: Input Source Code
Example: int sum = a + 42;
Step 2: Apply Regular Expressions: The lexer scans from left to right
and applies regex patterns to recognize tokens
Step 3: Generate Tokens: The lexer produces a sequence of tokens
◦ (KEYWORD, "int"), (IDENTIFIER, "sum"), (OPERATOR, "="),
(IDENTIFIER, "a"), (OPERATOR, "+"), (NUMBER, "42"),
(PUNCTUATION, ";")
Advantages of Using Regular Expressions for Tokenization
Efficiency : Regex-based tokenization is fast and efficient.
Flexibility: Regular expressions allow defining various token types
easily.
Simplicity: The lexer remains simple compared to writing manual
character-by-character parsing.
Limitations:
Cannot handle nested structures – Regex cannot parse balanced
parentheses like (a + (b - c)).
Cannot enforce syntax rules – Regular expressions only classify
tokens but do not check grammar (e.g., if x = 5 is invalid in most
languages, but regex won't detect it).
A Deterministic Finite automata (DFA) is a collection of
defined as a 5-tuples and is as follows:
M=(Q,Σ,δ,q0,F)
Q is a finite set of states.
∑ is a finite set of symbols called the alphabet.
δ is the transition function where δ: Q × ∑ → Q
q0 is the initial state from where any input is processed (q0
Q).
F is a set of final state/states of Q (F Q).
Let a DFA: Q = {a, b, c}, ∑ = {0, 1}, q0 = {a}, F = {c}
Transition Function
Present State Next State for Input 0 Next State for Input 1
a a b
b c a
c b c
Suppose we need a lexical analyzer that recognizes only arithmetic
expressions, including variable names and integer literals as operands.
Instead of creating a complex diagram with transitions for every
character, simplifications are made
Recognizing Variables: Variable names consist of letters (uppercase
and lowercase) and digits but must start with a letter. Instead of 52
transitions for each possible starting letter, a single LETTER class
represents all letters.
Recognizing Integer Literals: Integer literals consist of digits (0–9). A
DIGIT class groups all digits, simplifying the state diagram to use one
transition for all digits.
Combining LETTER and DIGIT: Since variable names may include
digits after the first letter, transitions can use both LETTER and DIGIT
classes to handle names efficiently. using character classes:
Utility subprograms help manage the lexical analyzer:
getChar: Retrieves the next input character, determines its class, and
stores it in a global variable.
addChar: Adds characters to the current token (ignoring spaces or
unnecessary characters).
getNonBlank: Skips whitespace to find the next lexeme.
lookup: Assigns token codes to single-character tokens like +, -, or
parentheses.
The syntax of a programming language defines the rules for writing
code that the compiler or interpreter can understand. It acts as the
language's "grammar," ensuring code is structured and meaningful,
much like rules in natural languages create valid sentences.
For example:
Keywords: Reserved words like if, while, or function
Operators: Symbols like +, -, or == used for operations.
Delimiters: Characters like {}, (), or ; to define blocks (separate
instructions).
Expressions and Statements: Rules for forming valid expressions (a
+ b) and statements (if (a > b) { ... }).
Indentation or Punctuation: Some languages rely on indentation,
while others use semicolons to terminate statements.
The combination of these rules ensures the program is logically and
structurally correct. Each programming language has its unique
syntactic rules that differentiate it from others.
The syntax analyzer is the second phase of compilation. It takes the
token stream produced by the lexical analyzer and checks if it conforms
to the grammatical structure of the programming language.
Key Functions
Parsing: Validates the arrangement of tokens against the language's
grammar, usually defined in the form of production rules.
Error Detection: Flags syntax errors (e.g., missing semicolons,
mismatched parentheses, improper order of tokens).
Construct Parse Tree: Builds a hierarchical tree (parse tree or abstract
syntax tree) that represents the syntactic structure of the program.
1. Simplicity: Techniques for lexical analysis are less complex than
those required for syntax analysis.
2. Efficiency: Although it pays to optimize the lexical analyzer,
because lexical analysis requires a significant portion of total
compilation time, it is not fruitful to optimize the syntax analyzer.
Separation facilitates this selective optimization.
3. Portability: Because the lexical analyzer reads input program files
and often includes buffering of that input, it is somewhat platform
dependent. However, the syntax analyzer can be platform
independent. It is always good to isolate machine-dependent parts of
any software system
1.2 Derivations and Parse Tree
A parse tree (also known as a syntax tree) visually represents the
syntactic structure of a source code according to the grammar of the
programming language. It derives from tokens by systematically
applying grammar rules.
A CFG shows us how to generate syntactically valid string of terminals.
Grammar Rules: The grammar defines how tokens combine to form
valid structures, such as expressions, statements, or declarations.
These rules are typically represented as production rules in a
context-free grammar (CFG).
Example Grammar Rule (Simplified):
int sum = a + 42;
<declaration> → <type> <identifier> = <expression> ;
<expression> → <identifier> + <value>
<type> → int | float | char
<value> → <integer_literal> | <string_literal>
Parsing and Building the Tree: The parser applies grammar rules to
construct the parse tree, ensuring the tokens form a valid structure. For
int sum = a + 42;, the parser builds the tree as follows:
<declaration>
<type> <identifier> = <expression>
int sum <identifier> + <value>
a 42
CFG for programming languages: Terminal symbols represent basic syntactic units
(lexemes).Nonterminal symbols use descriptive names like <while_statement> or
<expr> to represent higher-level constructs. Mixed strings(Greek Letter) are used in
grammar rules, which serve as the foundation for parsing algorithms.
Aspect Lexical Analyser (Lexer) Syntax Analyser (Parser)
Converts source code Checks token
Role
into tokens. sequence for syntax.
Parse tree or syntax
Output Stream of tokens.
errors.
Catches lexical errors Catches syntax errors
Error Detection (e.g., invalid (e.g., missing
identifiers). operators).
Low-level (characters High-level (language
Focus
and tokens). grammar).