c+ part 1

Sun, 11 Nov 2007

I'm in the process of writing a tokeniser for a c-like programming language. A tokeniser takes a file and splits it up into meaningful chunks. For example, the program I'm using to write this has a tokeniser which splits up what I'm writing into chunks so it can spell check the individual words. So it needs to know that a word ends either on a space or a full stop or whatever. Unfortunately, it doesn't know that a word can continue on a ' so when I write a word like "doesn't", then it tells me that I mis-spelt "doesn" and suggests to replace it with "dozen" (leading to "dozen't", which at least sounds right when read).

There are lots of programs available that will generate the tokeniser for you, if you give them all the rules they need to know, expressed in some simple language. I'm not a huge fan of these, partially because the code generated is ugly, partially because the code is only maintainable with the use of those programs, but mainly because it is more fun to do it yourself.

The next step after the tokeniser is the parser. The grammar checker in a word-processor will have some form of parser, although since English is so horrible, it won't nearly be as pretty as the one for a designed language like a programming language. Once again you can either get a program to generate it for you, or you can do it yourself. When you get the program to generate the parser for you, you have to describe the language in an unambiguous way, so there are a few standard forms for doing this, here is one:

SYNTAX     = { PRODUCTION } .
PRODUCTION = IDENTIFIER "=" EXPRESSION "." .
EXPRESSION = TERM { "|" TERM } .
TERM       = FACTOR { FACTOR } .
FACTOR     = IDENTIFIER
            | LITERAL
            | "[" EXPRESSION "]"
            | "(" EXPRESSION ")"
            | "{" EXPRESSION "}" .
IDENTIFIER = letter { letter } .
LITERAL    = """" character { character } """" .

I saw this on Wikipedia and found it amusing enough to include here. So obviously, it isn't the sort of thing you say at a party or in a stand-up routine, and I think if you even smile when you read it then something is wrong. But what it is as the Wirth syntax notation being used to define itself. So it is about as funny as the equivalent of looking up "dictionary" in the dictionary, knowing that if you didn't know what a dictionary was then you wouldn't be looking in a dictionary to find out. Hilarious.

So, what is "c+"? c+ is the pet name of this programming language I'm working on. It also has a lame computer science-ish joke behind it. If you have a function that tells you the hypotenuse of a right angled triangle given the sides:

hypot(3, 4) => 5

Then that's a pretty cool function. If you wanted to make a function that gave you the hypotenuse of a right angled triangle that has one side of length 1, and the other of an unknown length you could make a new function called hypot1:

function hypot1(other_side) {
    return hypot(1, other_side)
}

Of if you had partial function application, you could just say that the new function is called hypot(1) - that is, you give "hypot" only 1 of the arguments. So 'c+' is just a function which will return the sum of the single argument and c. I don't know if I'll be so awesome as to allow it to work with operators like that... and the cost of muddying up function overloading might mean that either partial function application will be scrapped or given a different syntax.

The main goal of c+ is to have a fast language that is suitable for real-time stuff, but also safe. The safety will be done by not actually having any IO in the language, so a program in c+ will be a "pure" function, and won't be able to change the state of anything. At this point, you've either stopped reading because there are no pictures or links to follow, or you might be wondering how such a language is useful at all. The plan is that IO is done through explicit interfaces passed into the main function.

So, where a c program might look like this:

#include <stdio.h>
int main(int argc, char *argv[]) {
    printf("I'm printing to the screen\n");
    return 0;
}

And can be compiled and run standalone, a c+ program might look like this:

#include <printer>
public void main(Printer p) {
    p.print("I'm printing to the screen\n");
}

And rather than being able to be run straight off, you will need to have a driver program that supplies some sort of "Printer" object. A more complicated program might want a "NetworkAccess" object, or a "FileReader" object or whatever, but basically, where the c program says "I'm going to print stuff to the screen whether you like it or not", the "driver" of the c+ program gets to decide whether or not it will be able to print or not.

The model that seems to be popular in other modern languages is a sandboxed model (like C# or Java). These let the program be written in the c way, but when the program actually goes to print, it prints to a 'virtual' screen, which can then decide if the program should actually be allowed to print (simplified, and I probably don't understand it). The problem I have with this method is that the person running the program will find it difficult to make meaningful decisions when writing the security policy (if there were 1000 different things the program might want to do, they have to think about each one of those and decide if it should be allowed or not). Whereas with the c+ method, they only have to understand the things that the program declares that it wants to use (in this case, it wants to use a Printer object, so you could check and see that a printer just prints to your screen or whatever and know that the program is also wanting to access the internet or something).

The whole language design is subject to change, or to be thrown away as a bad idea, so don't start thinking about porting all your favourite applications to it. It probably seems a bit crazy to be writing the tokeniser for a language that has a design up in the air. But I find that actually implementing things makes the real issues much more visible.

Name & email are optional. Email will not be obfuscated.
HTML tags will be removed except hyperlinks.
 

About

I'm a nerd living in Sydney. This is a place where I can write stuff about my interests and not care that no one else is reading.

I like music, maths, programming, pretty pictures, filters and other good things.

(more info)

It should be fairly obvious that this isn't connected to my employer at all.

Email me (not a catchpa)

Email policy

Subscribe

RSS Feed RSS

Get an aggregator

Liferea (Linux)

Vienna (OSX)

Feedreader (Windows)

Google Reader (Web based)

I've only used Liferea, so I can't vouch for the other ones.

About this site

This site runs a (modified) version of blosxom.

The host is GeekISP, and they seem to do an excellent job.