This study material has been compiled from lecture notes (copy-pasted text) and an audio transcript for the CMPE318 Concepts of Programming Languages course, Spring 2025-2026.
📚 Chapter 3: Describing Syntax and Semantics in Programming Languages
1. Introduction to Syntax and Semantics
Understanding how programming languages are structured and interpreted is fundamental to computer science. This involves two core concepts: syntax and semantics.
- Syntax 📚: Refers to the form or structure of expressions, statements, and program units. It dictates the rules for how code must be written to be considered grammatically correct.
- Semantics 📚: Defines the meaning of these expressions, statements, and program units. It explains what a syntactically correct program actually does.
Together, syntax and semantics provide a complete definition of a programming language. This definition is crucial for various stakeholders:
- Other Language Designers: To understand existing language features.
- Implementers: To build compilers and interpreters.
- Programmers: To correctly use the language.
Example (Java while statement):
- Syntax:
while (boolean_expr) statement - Semantics: If
boolean_exprevaluates totrue,statementis executed, and control returns to re-evaluateboolean_expr. Ifboolean_exprisfalse, control proceeds past thewhileconstruct.
While often discussed separately, syntax and semantics are closely related. In a well-designed language, the syntax should strongly suggest the statement's intended meaning. Describing syntax is generally easier than semantics, as concise and universally accepted notations exist for syntax, but not yet for semantics.
2. The General Problem of Describing Syntax: Terminology
To formally describe the syntax of a language, specific terminology is used:
- Sentence 📚: A string of characters over a defined alphabet.
- Language 📚: A set of sentences over an alphabet.
- Lexeme 📚: The lowest-level syntactic unit of a language (e.g.,
*,sum,if). - Token 📚: A category of lexemes (e.g.,
identifier,assignment_operator).
Example (Java): index = 2 * count + 17;
| Lexeme | Token |
| :-------- | :---------------- |
| index | identifier |
| = | assignment_operator |
| 2 | integer_literal |
| * | mult_operator |
| count | identifier |
| + | addition_operator |
| 17 | integer_literal |
| ; | semicolon |
3. Formal Definition of Languages
Formal definitions of languages can be approached in two ways:
3.1. Recognizers
- A recognizer is a mechanism (e.g., a program) that determines whether a given input string belongs to a specific language.
- It reads strings and indicates if they are "in" the language (accepts) or "not in" the language (rejects).
- ✅ The syntax analysis part of a compiler, often called a parser, acts as a recognizer, determining if a program is syntactically correct.
3.2. Generators
- A generator is a device that produces all valid sentences of a language.
- One can verify the syntax of a statement by comparing it with the structure generated by the device.
- 💡 The close connection between formal generation and recognition devices for the same language was a pivotal discovery in computer science, foundational to formal languages and compiler design theory.
4. Formal Methods for Describing Syntax: BNF and Context-Free Grammars
4.1. Context-Free Grammars (CFGs)
- Developed by Noam Chomsky in the mid-1950s to describe the syntax of natural languages.
- They define a class of languages called context-free languages.
4.2. Backus-Naur Form (BNF)
- Invented by John Backus in 1959 to describe the syntax of Algol 58.
- ✅ BNF is formally equivalent to context-free grammars.
4.3. BNF Fundamentals
- Abstractions (Nonterminal Symbols) 📚: Represent classes of syntactic structures and are enclosed in angle brackets (e.g.,
<if_stmt>). - Terminal Symbols 📚: Are lexemes or tokens that appear literally in the language (e.g.,
if,(,identifier). - Rule (Production) 📚: Has a Left-Hand Side (LHS), which is a nonterminal, and a Right-Hand Side (RHS), a string of terminals and/or nonterminals.
- Example:
<if_stmt> → if ( <logic_expr> ) <stmt>
- Example:
- Grammar 📚: A finite, non-empty set of rules.
- Start Symbol 📚: A special nonterminal from which all derivations begin.
- Alternative RHSs: A nonterminal can have multiple RHSs, separated by
|.- Example:
<stmt> → <single_stmt> | begin <stmt_list> end
- Example:
- Syntactic Lists: Described using recursion.
- Example:
<ident_list> → identifier | identifier, <ident_list>
- Example:
4.4. Derivations and Parse Trees
- Derivation 📚: A repeated application of grammar rules, starting with the start symbol and ending with a sentence (a string of only terminal symbols).
- Sentential Form 📚: Every string of symbols in a derivation.
- Sentence 📚: A sentential form consisting only of terminal symbols.
- Leftmost Derivation: One where the leftmost nonterminal in each sentential form is expanded first.
- Parse Tree 📚: A hierarchical representation of a derivation, visually showing the syntactic structure of a sentence.
Example Derivation:
<program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
Example Parse Tree for A = B * (A + C):
<assign>
|
<id> = <expr>
/ \
A <expr>
|
<id> * <expr>
/ \
B <expr>
|
( <expr> )
/ \
<expr> <expr>
| |
<id> + <expr> <id>
/ \ \
A <expr> C
|
<id>
|
C
4.5. Ambiguity in Grammars
- Ambiguous Grammar ⚠️: A grammar is ambiguous if it generates a sentence that has two or more distinct parse trees.
- Problem: Compilers often base the semantics (meaning) of structures on their syntactic form (parse tree). If a structure has multiple parse trees, its meaning cannot be uniquely determined, leading to incorrect code generation.
- Example:
A = B + C * Acan have two parse trees, one where+has precedence and one where*has precedence, leading to different interpretations.
Addressing Ambiguity:
- Operator Precedence: Can be enforced by introducing different "layers" of expressions and subexpressions in the grammar.
- Example:
<expr> → <expr> + <term> | <term>and<term> → <term> * <factor> | <factor>ensures*has higher precedence than+.
- Example:
- Operator Associativity: Specifies which operator takes precedence when two operators of the same precedence appear consecutively (e.g.,
A - B - C). Left associativity ((A-B)-C) is common for arithmetic operators.- Example:
<expr> → <expr> + const | const(left-associative) vs.<expr> → const + <expr> | const(right-associative).
- Example:
- Good News ✅: Common forms of ambiguity (precedence & associativity) can often be handled by carefully structuring the grammar or by using parser generators like YACC and Bison.
5. Extended BNF (EBNF)
- Purpose: EBNF is a few simple extensions to BNF that make expressing grammars more convenient and concise.
- Equivalence: EBNF is no more "powerful" than BNF; anything expressible in EBNF can also be expressed in BNF.
- Usage: Widely used as the de facto standard for defining programming languages.
EBNF Extensions:
- Optional parts: Placed in square brackets
[].- Example:
<func_call> → ident ([<expr_list>])(meaningexpr_listis optional)
- Example:
- Alternative parts: Placed inside parentheses and separated by vertical bars
() |.- Example:
<term> → <term>(+|-) const(meaning+or-)
- Example:
- Repetitions (0 or more): Placed inside curly braces
{}.- Example:
<ident> → letter {letter|digit}(meaning a letter followed by zero or more letters or digits)
- Example:
BNF vs. EBNF Example:
| BNF | EBNF |
| :---------------------------------------- | :--------------------------------- |
| <expr> → <expr> + <term> | <expr> → <term> {(+ | -) <term>} |
| | <expr> - <term> | |
| | <term> | |
| <term> → <term> * <factor> | <term> → <factor> {(* | /) <factor>} |
| | <term> / <factor> | |
| | <factor> | |
6. Static Semantics
6.1. Limitations of Context-Free Grammars
- CFGs (and thus BNF/EBNF) cannot describe all syntactic rules of programming languages.
- Troublesome categories:
- Context-free but cumbersome (e.g., requiring operands in an expression to be of compatible types).
- Non-context-free (e.g., a variable must be declared before it is used).
6.2. Definition of Static Semantics
- Static Semantics 📚: Refers to the legal forms of programs that go beyond context-free syntax. These rules are not directly related to the program's meaning during execution but rather to its structural correctness.
- Static Analysis: These rules are called "static" because the analysis required to check them can be performed at compile time, before program execution.
- Many static semantic rules relate to type constraints (e.g., type checking).
6.3. Attribute Grammars
- Purpose: A formal mechanism devised to describe and check static semantics.
- Definition 📚: An attribute grammar extends a context-free grammar
G = (S, N, T, P)with:- Attribute Values: For each grammar symbol
x, a setA(x)of attribute values. These attributes carry semantic information. - Evaluation Rules: For each production rule, a set of functions that define how attribute values are computed for the nonterminals in that rule.
- Predicates: A set of Boolean conditions to check for attribute consistency (e.g., type compatibility).
- Attribute Values: For each grammar symbol
- Primary Value:
- Specification of static semantics.
- Aid in compiler design for static semantics checking.
- 💡 Although not always formally used, the basic concepts of attribute grammars are informally applied in every compiler.








