# Part 2: Introduction to Parsing
In this part of our compiler writing journey, I'm going to introduce the
basics of a parser. As I mentioned in the first part, the job of the parser
is to recognise the syntax and structural elements of the input and ensure
that they conform to the *grammar* of the language.
We already have several language elements that we can scan in, i.e. our tokens:
+ the four basic maths operators: `*`, `/`, `+` and `-`
+ decimal whole numbers which have 1 or more digits `0` .. `9`
Now let's define a grammar for the language that our parser will recognise.
## BNF: Backus-Naur Form
You will come across the use of
[BNF ](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form)
at some point if you get into dealing
with computer languages. I will just introduce enough of the BNF syntax
here to express the grammar we want to recognise.
We want a grammar to express maths expressions with whole numbers. Here is
the BNF description of the grammar:
```
expression: number
| expression '*' expression
| expression '/' expression
| expression '+' expression
| expression '-' expression
;
number: T_INTLIT
;
```
The vertical bars separate options in the grammar, so the above says:
+ An expression could be just a number, or
+ An expression is two expressions separated by a '*' token, or
+ An expression is two expressions separated by a '/' token, or
+ An expression is two expressions separated by a '+' token, or
+ An expression is two expressions separated by a '-' token
+ A number is always a T_INTLIT token
It should be pretty obvious that the BNF definition of the grammar is
*recursive*: an expression is defined by referencing other expressions.
But there is a way to *bottom-out" the recursion: when an expression
turns out to be a number, this is always a T_INTLIT token and thus
not recursive.
In BNF, we say that "expression" and "number" are *non-terminal* symbols, as
they are produced by rules in the grammar. However, T_INTLIT is a *terminal*
symbol as it is not defined by any rule. Instead, it is an already-recognised
token in the language. Similarly, the four maths operator tokens are
terminal symbols.
## Recursive Descent Parsing
Given that the grammar for our language is recursive, it makes sense for
us to try and parse it recursively. What we need to do is to read in a
token, then *look ahead* to the next token. Based on what the next token
is, we can then decide what path we need to take to parse the input.
This may require us to recursively call a function that has already been
called.
In our case, the first token in any expression will be a number and this
may be followed by maths operator. After that there may only
be a single number, or there may be the start of a whole new expression.
How can we parse this recursively?
We can write pseudo-code that looks like this:
```
function expression() {
Scan and check the first token is a number. Error if it's not
Get the next token
If we have reached the end of the input, return, i.e. base case
Otherwise, call expression()
}
```
Let's run this function on the input `2 + 3 - 5 T_EOF` where `T_EOF`
is a token that reflects the end of the input. I will number each
call to `expression()`.
```
expression0:
Scan in the 2, it's a number
Get next token, +, which isn't T_EOF
Call expression()
expression1:
Scan in the 3, it's a number
Get next token, -, which isn't T_EOF
Call expression()
expression2:
Scan in the 5, it's a number
Get next token, T_EOF, so return from expression2
return from expression1
return from expression0
```
Yes, the function was able to recursively parse the input
`2 + 3 - 5 T_EOF`.
Of course, we haven't done anything with the
input, but that isn't the job of the parser. The parser's job is
to *recognise* the input, and warn of any syntax errors. Someone
else is going to do the *semantic analysis* of the input, i.e. to
understand and perform the meaning of this input.
> Later on, you will see that this isn't actually true. It often makes
sense to intertwine the syntax analysis and semantic analysis.
## Abstract Syntax Trees
To do the semantic analysis, we need code that either interprets
the recognised input, or translates it to another format, e.g.
assembly code. In this part of the journey, we will build an
interpreter for the input. But to get there, we are first going to
convert the input into an
[abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree),
also known as an AST.
I highly recommend you read this short explanation of ASTs:
+ [Leveling Up One’s Parsing Game With ASTs](https://medium.com/basecs/leveling-up-ones-parsing-game-with-asts-d7a6fc2400ff)
by Vaidehi Joshi
It's well written and really help to explain the purpose and structure of ASTs.
Don't worry, I'll be here when you get back.
The structure of each node in the AST that we will build is described in
`defs.h`:
```c
// AST node types
enum {
A_ADD, A_SUBTRACT, A_MULTIPLY, A_DIVIDE, A_INTLIT
};
// Abstract Syntax Tree structure
struct ASTnode {
int op; // "Operation" to be performed on this tree
struct ASTnode *left; // Left and right child trees
struct ASTnode *right;
int intvalue; // For A_INTLIT, the integer value
};
```
Some AST nodes, like those with `op` values `A_ADD` and `A_SUBTRACT`
have two child ASTs that are pointed to by `left` and `right`. Later on,
we will add or subtract the values of the sub-trees.
Alternatively, an AST node with the `op` value A_INTLIT represents
an integer value. It has no sub-tree children, just a value in the
`intvalue` field.
## Building AST Nodes and Trees
The code in `tree.c` has the functions to build ASTs. The most
general function, `mkastnode()`, takes values for all four
fields in an AST node. It allocates the node, populates the field
values and returns a pointer to the node:
```c
// Build and return a generic AST node
struct ASTnode *mkastnode(int op, struct ASTnode *left,
struct ASTnode *right, int intvalue) {
struct ASTnode *n;
// Malloc a new ASTnode
n = (struct ASTnode *) malloc(sizeof(struct ASTnode));
if (n == NULL) {
fprintf(stderr, "Unable to malloc in mkastnode()\n");
exit(1);
}
// Copy in the field values and return it
n->op = op;
n->left = left;
n->right = right;
n->intvalue = intvalue;
return (n);
}
```
Given this, we can write more specific functions that make a leaf AST
node (i.e. one with no children), and make an AST node with a single child:
```c
// Make an AST leaf node
struct ASTnode *mkastleaf(int op, int intvalue) {
return (mkastnode(op, NULL, NULL, intvalue));
}
// Make a unary AST node: only one child
struct ASTnode *mkastunary(int op, struct ASTnode *left, int intvalue) {
return (mkastnode(op, left, NULL, intvalue));
```
## Purpose of the AST
We are going to use an AST to store each expression that we recognise
so that, later on, we can traverse it recursively to calculate the
final value of the expression. We do want to deal with the precedence of the
maths operators. Here is an example.
Consider the expression `2 * 3 + 4 * 5`. Now, multiplication has higher
precedence that addition. Therefore, we want to *bind* the multiplication
operands together and perform these operations before we do the addition.
If we generated the AST tree to look like this:
```
+
/ \
/ \
/ \
* *
/ \ / \
2 3 4 5
```
then, when traversing the tree, we would perform `2*3` first, then `4*5`.
Once we have these results, we can then pass them up to the root of the
tree to perform the addition.
## A Naive Expression Parser
Now, we could re-use the token values from our scanner as the AST node
operation values, but I like to keep the concept
评论0