Lecture Notes for Com Sci 221, Programming Languages
Last modified: Fri Jan 27 15:14:16 1995
Definition time again, using Webster:
Programming language syntax has been formalized very successfully into a cookbook discipline. So, we can deal with it briefly and efficiently, and get on to our informal concern with the much more difficult issues of semantics.
There are three famous methods for specifying programming language syntax:
E --> T E --> T + E E --> T - E T --> F T --> F * T T --> F div T F --> ( E ) F --> V F --> CFigure 2: example of a Context-Free Grammar
There is no essential difference between a CFG and a BNF. In a BNF, the derivation symbol in each production is written ::=, instead of -->. The nonterminals are written as words or phrases inside pointy brackets (< ... >), instead of capital letters. The terminals are quoted. None of this should be very exciting. Syntax charts look quite different at first glance, but Homework 2 shows by example that they are also essentially the same as CFGs and BNFs. From now on, I will refer to CFGs when I discuss the qualities shared by all three notations.
There are some serious notational conveniences, such as the iterative and conditional notations in EBNF, that do not affect the basic power of CFGs and similar systems. Make sure that you understand how each extended notation may be translated into pure CFG notation.
Books and articles in theoretical computer science, and even in programming languages and linguistics, emphasize the use of CFGs to define which strings of symbols are and are not syntactically correct members of a given language. That is not really the important value of CFGs. The truly important use of CFGs is to define a way to parse syntactically correct strings: that is to associate with a string a tree structure (called a derivation tree or parse tree, or sometimes just a parse) presenting the syntax of the string. The parse of a program is a much better starting point for interpreting or compiling the program than the plain source text.
CFGs are wonderful because they are simultaneously readable by humans, and suitable as the basis for completely automatic parsing. In effect, CFGs are a sort of highly self-documenting programming language for parsers. They are included in programming language manuals as the last resort documentation of syntactic issues. And, they are processed by parser generators, such as Yacc and Bison, which compile them into parsing code.
CFGs represent an incredible success story in computer science. In the olden days, when FORTRAN was just being invented, the problem of parsing a program was the subject of Ph.D. dissertations. Now, the automatic processing of CFG specifications allows college students in a compiler writing course to solve parsing problems routinely. The first automatic parser generators were so exciting that people called them "compiler compilers." Of course, a parser generator merely frees the implementor of a compiler to spend her time on the really hard part: generating good code.
The marvellously self-documenting quality of CFGs arises because, when they are constructed wisely, the nonterminal symbols of CFGs represent sensible syntactic categories. For example, in the CFG for arithmetic expressions in Figure 2 above, the nonterminal symbol E represents the category of expressions, V represents variables, and C represents constants. Similarly meaningful categories, such as statements, declarations, etc., also correspond to particular nonterminal symbols in a complete CFG for a whole programming language. But, there are also nonterminal symbols, such as T and F in Figure 2 above, that do not correspond to grammatically meaningful and useful categories. Sure, T is supposed to stand for "term," and F for "factor," but those are not particularly important grammatical categories in a program. Rather, T and F are gimmicks, added to the grammar to enforce the normal rules giving * and div precedence over + and -.
You must learn to distinguish, based on intuition and common sense,
the grammatically meaningful parts of the structure determined by a
CFG from the gimmicks. There are some extended notations for CFG that
reduce the dependence on gimmicks, but a lot of programming language
manuals still give the gimmicks equal status with the meaningful
symbols.
End
Monday 16 January
My lecture in class did not follow the order of the lecture notes very
well here, so this cut point is particularly fuzzy.
Begin
Wednesday 18 January
I refuse to be absolutely precise, but I'll come much closer
than with other definitions. For almost all purposes, almost all of
the syntactic qualities of almost all programming languages may be
regarded as a syntax tree. Most people in
computer science say "abstract syntax tree" instead of "syntax tree,"
because they never looked up the definition of syntax, and they think
that "syntax" by itself (or "concrete syntax" when they want to be
more pedantic) means typography, rather than structure.
So, syntax is a tree. But what tree? Well, given the right
CFG for a language, the syntax tree of a program is almost the parse tree,
except that the terminal symbols and gimmicks are taken out, and the
natural conceptual operators are put in. This is best understood by
example. Consider the expression
x + y * 3 + z * (w - 4)
The parse tree, using the grammar in Figure 2
(with some obvious additional productions to get rid of Vs and
Cs) is shown in Figure 3. The
most usual idea of the syntax tree is shown in Figure 4. Notice that the extra steps
involving T and F have been omitted, since they are
really just gimmicks to enforce precedence. At each node of the tree,
instead of the nonterminal from the parse tree, we have the operator
that is being applied. Reasonable people may disagree over fine points
in the construction of syntax trees. For example, if + and * are
understood as operations combining more than 2 operands (which is
suggested by the EBNF version on p. 19 of the text), then we might
prefer the syntax tree of Figure 5,
which treats the iteration of the production E --> T +
E as a gimmick, rather than a structural step.
I wrote out "add," "mult," etc. in this example to emphasize that the operation is not the same thing as the terminal symbol (+ or *) that corresponds to it so naturally. In the future, I will use the most convenient and mnemonic symbols in syntax trees, which will often be the same as the symbols in the "concrete syntax." In other examples, such as the if ... then ... else ... fi example in Figure 6, there is no clear 1-1 correspondence between "concrete" symbols and "abstract" operators. Notice that in popular mathematical notation, there is often no terminal symbol at all to denote multiplication. Also, parentheses do not correspond to anything in the syntax tree. Rather, they are a gimmick involving terminal symbols, used to control the shape of the syntax tree.
Entire programs, as well as expressions, have natural syntax trees. In principle, there is nothing at all subtle about associating a syntax tree with a program, but many students confuse syntax trees with the similar looking, but very different, flow charts. A syntax tree shows the structure of the program as it is constructed from its parts. A flow chart shows the structure of the execution of the program. The best way to understand syntax trees for programs is to study carefully the example in Figure 7, which gives a syntax tree for the program in Figure 8 below.
Puzzle: does the program above halt for all inputs?read(i); while i>1 do if odd(i) then i := 3i+1 else i := i/2 fi odFigure 8