c+ part 2: Parsing Left and Right Associative Binary Operators

Mon, 17 Dec 2007

So this is something mainly for my own reference, so don't expect much interesting stuff unless you are writing a parser. But here is a tiny bit of background: When you want a computer to understand a language, you first get it to understand the words (called 'tokens') so it knows where each token starts and ends (see part 1 for slightly more on this). Once you do this, you get a list of words and you need to give them some structure so you know which adjectives relate to which nouns and that sort of thing. This is called parsing.

The obvious interesting thing in parsing is parsing mathematical equations. When you parse "1 + 2 * 3", you need to remember the order of operations, so you know that the "+" refers to the "1" and the "2 * 3", not to the "1" and the "2".

Typically, you write a description of the language in a language like in part 1:

EXPRESSION = EXPRESSION '+' MULTIPLY_EXPRESSION
           | MULTIPLY_EXPRESSION

MULTIPLY_EXPRESSION = MULTIPLY_EXPRESSION '*' PRIMARY_EXPRESSION
                    | PRIMARY_EXPRESSION

PRIMARY_EXPRESSION = number
                   | '(' EXPRESSION ')'

And then rewrite it so that it is easier for a computer to know which rule to apply:

EXPRESSION = MULTIPLY_EXPRESSION EXPRESSION_2
EXPRESSION_2 = nothing
             | '+' EXPRESSION_2

MULTIPLY_EXPRESSION = PRIMARY_EXPRESSION MULTIPLY_EXPRESSION_2
MULTIPLY_EXPRESSION_2 = nothing
                      | '*' MULTIPLY_EXPRESSION_2

PRIMARY_EXPRESSION = number
                   | '(' EXPRESSION ')'

With this form, we don't get stuck in loops trying to parse stuff. Unfortunately, if we were to parse "1 + 2 + 3", we would parse this as "1 + (2 + 3)", which doesn't seem like much of a problem with a "+" sign in there, but there is a big difference between: "(1 - 1) - 1" and "1 - (1 - 1)". So what has happened is that we have changed the associativity of the operators to being right to left.

I've always hated fixing this problem, but I found a very neat algorithm for parsing it in a top-down recursive descent parser called precedence climbing. It is a pretty simple idea in a way - instead of writing a set of functions per precedence level, write a general function and have a table of operators with their precedences and associativity, then the general function can just solve the big picture problems and you don't have to worry.

The relevant part of my parser now looks like this:

static ParseNode parse_general_expression(Tokeniser t, int precedence) {
    int num_operators = (sizeof binary_operators) / (sizeof binary_operators[0]);

    ParseNode t0 = parse_primary_expression(t);

    while (1) {
        Token op = tokeniser_current(t);

        int operator_index = -1;
        for (int i = 0; i < num_operators; i++) {
            if (binary_operators[i].type == token_get_type(op)) {
                operator_index = i;
                break;
            }
        }

        if (operator_index < 0) {
            break;
        }
        if (binary_operators[operator_index].precedence < precedence) {
            break;
        }

        tokeniser_consume(t);
        int q;
        if (binary_operators[operator_index].left_associative) {
            q = 1 + binary_operators[operator_index].precedence;
        }
        else {
            q = binary_operators[operator_index].precedence;
        }

        ParseNode t1 = parse_general_expression(t, q);

        t0 = create_parse_node_binary(PNT_BINARY_OPERATOR, op, t0, t1);
    }

    return t0;
}

But I do need to add some more magic to it, since I'm experimenting with some more whacky precedence rules: I've always thought that "1 * 2+3" should be parsed as "1 * (2+3)" since that's what it looks like, so I'm going to try to parse things that way and see if it stays sane. Typically I don't like whitespace dependence (since lots of things (like HTML) will collapse multiple consecutive spaces into 1), but it just seems so much more natural to me.

very good,just wondering...!is it possible to get this parser.For learning purpose only. I am trying to create mine based on precedence climibing.
71 days after story
Name & email are optional. Email will not be obfuscated.
HTML tags will be removed except hyperlinks.
 

About

I'm a nerd living in Sydney. This is a place where I can write stuff about my interests and not care that no one else is reading.

I like music, maths, programming, pretty pictures, filters and other good things.

(more info)

It should be fairly obvious that this isn't connected to my employer at all.

Email me (not a catchpa)

Email policy

Subscribe

RSS Feed RSS

Get an aggregator

Liferea (Linux)

Vienna (OSX)

Feedreader (Windows)

Google Reader (Web based)

I've only used Liferea, so I can't vouch for the other ones.

About this site

This site runs a (modified) version of blosxom.

The host is GeekISP, and they seem to do an excellent job.