Break Of Some Length

Sat, 22 Dec 2007

I'll be away for a little bit. Weekly updates therefore won't happen until they start happening again.

In the mean time, read these things:

Revealing Errors

The Universe of Discourse

Basic Instructions

Good Math, Bad Math

God Plays Dice

I think they are all pretty interesting, and probably not as well known as my other regular sites.

End Of Year Home Directory Clean Out

Sat, 22 Dec 2007

I tend to accumulate lots of junk in my home directory. Today seems like a good day to get rid of it. Occasionally I find a quote funny enough to save in case I need it later, other times I need to run a quick program (or see if something compiles), basically there is lots of garbage, but hopefully some of it will be amusing enough. So presenting the slightly more interesting things which I've just deleted:

Quotes

"As old hands will be well aware, it's not a new C standard without an entirely new meaning for the static keyword." - Peter Seebach

"Too many errors on one line (make fewer)" - Apple MPW C compiler

"Open letter = Can't find your email address" (I'm not sure who wrote this)

Media

Wallpaper thumbnail

An old wallpaper I used to use.

Some drum patterns (these were done in rosegarden a while ago using a different drum patch, so they sound nothing like what they sounded like originally)

Various attempts at ripping just the audio track of a dvd. Why can't someone make this easy (bonus points if each chapter gets put into its own file).

Image from log file from an AIBO

This was taken from a log file I had of a Sony AIBO running around when I was at uni.

Other

Web site and executable served by a storm worm infested computer. Just out of curiosity.

Instructions for the automatic timer I use to turn my air conditioner off at night.

Rant about whitespace dependance in programming languages and why I don't like it.

One of the files I included in my vim configuration (which I shouldn't have deleted)

Programs

A program for finding square triangular numbers (like 36 = 6 * 6 = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8)

This snippet (and testing code to make sure it works in the range I needed):

int v = i;
v--;
v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16;
v++;

(probably from Bit Twiddling Hacks)

And finally:

cat /*dev/null; echo "Happy New Year"\!
cat <<c*/ /*dev/null | cat > /dev/null
c */ () {} /*
c */ main() { cat(); printf("Happy New Year!\n"); } /*
17      format('  Happy New Year!')
         write (6,17)
         stop
         end
c This program runs in four languages with the same effect.
c The languages are C, Fortran, C Shell and Bourne Shell.
c Written by Vadim Antonov, avg@hq.demos.su
c*/

c+ part 2: Parsing Left and Right Associative Binary Operators

Mon, 17 Dec 2007

So this is something mainly for my own reference, so don't expect much interesting stuff unless you are writing a parser. But here is a tiny bit of background: When you want a computer to understand a language, you first get it to understand the words (called 'tokens') so it knows where each token starts and ends (see part 1 for slightly more on this). Once you do this, you get a list of words and you need to give them some structure so you know which adjectives relate to which nouns and that sort of thing. This is called parsing.

The obvious interesting thing in parsing is parsing mathematical equations. When you parse "1 + 2 * 3", you need to remember the order of operations, so you know that the "+" refers to the "1" and the "2 * 3", not to the "1" and the "2".

Typically, you write a description of the language in a language like in part 1:

EXPRESSION = EXPRESSION '+' MULTIPLY_EXPRESSION
           | MULTIPLY_EXPRESSION

MULTIPLY_EXPRESSION = MULTIPLY_EXPRESSION '*' PRIMARY_EXPRESSION
                    | PRIMARY_EXPRESSION

PRIMARY_EXPRESSION = number
                   | '(' EXPRESSION ')'

And then rewrite it so that it is easier for a computer to know which rule to apply:

EXPRESSION = MULTIPLY_EXPRESSION EXPRESSION_2
EXPRESSION_2 = nothing
             | '+' EXPRESSION_2

MULTIPLY_EXPRESSION = PRIMARY_EXPRESSION MULTIPLY_EXPRESSION_2
MULTIPLY_EXPRESSION_2 = nothing
                      | '*' MULTIPLY_EXPRESSION_2

PRIMARY_EXPRESSION = number
                   | '(' EXPRESSION ')'

With this form, we don't get stuck in loops trying to parse stuff. Unfortunately, if we were to parse "1 + 2 + 3", we would parse this as "1 + (2 + 3)", which doesn't seem like much of a problem with a "+" sign in there, but there is a big difference between: "(1 - 1) - 1" and "1 - (1 - 1)". So what has happened is that we have changed the associativity of the operators to being right to left.

I've always hated fixing this problem, but I found a very neat algorithm for parsing it in a top-down recursive descent parser called precedence climbing. It is a pretty simple idea in a way - instead of writing a set of functions per precedence level, write a general function and have a table of operators with their precedences and associativity, then the general function can just solve the big picture problems and you don't have to worry.

The relevant part of my parser now looks like this:

static ParseNode parse_general_expression(Tokeniser t, int precedence) {
    int num_operators = (sizeof binary_operators) / (sizeof binary_operators[0]);

    ParseNode t0 = parse_primary_expression(t);

    while (1) {
        Token op = tokeniser_current(t);

        int operator_index = -1;
        for (int i = 0; i < num_operators; i++) {
            if (binary_operators[i].type == token_get_type(op)) {
                operator_index = i;
                break;
            }
        }

        if (operator_index < 0) {
            break;
        }
        if (binary_operators[operator_index].precedence < precedence) {
            break;
        }

        tokeniser_consume(t);
        int q;
        if (binary_operators[operator_index].left_associative) {
            q = 1 + binary_operators[operator_index].precedence;
        }
        else {
            q = binary_operators[operator_index].precedence;
        }

        ParseNode t1 = parse_general_expression(t, q);

        t0 = create_parse_node_binary(PNT_BINARY_OPERATOR, op, t0, t1);
    }

    return t0;
}

But I do need to add some more magic to it, since I'm experimenting with some more whacky precedence rules: I've always thought that "1 * 2+3" should be parsed as "1 * (2+3)" since that's what it looks like, so I'm going to try to parse things that way and see if it stays sane. Typically I don't like whitespace dependence (since lots of things (like HTML) will collapse multiple consecutive spaces into 1), but it just seems so much more natural to me.

Spam and Stuff

Mon, 10 Dec 2007

I've been getting some spam in the comments here. I haven't done anything to fix it yet, but I am actively deleting the posts. I'll think of a good solution a bit later, at the moment it is about 3 posts a day, so it isn't too bad to get rid of stuff, but soon it will annoy me enough and I'll fix it good and proper, possibly with a large change in the backend (and maybe frontend) of the site. I'd like to finish the c+ stuff first though (then I'll be able to give it a test project too).

Lots of random things today, its all a bit nerdy at the moment, I haven't been interested in much non-nerdy stuff lately.

Subgrep Fix

Mon, 10 Dec 2007

As pointed out to me in the comments, I had a bug in the subgrep program last week. I'll explain it, but first detail how the whole thing works:

1) All of the search patterns are represented by a DFA, when you draw a DFA on paper it is a bunch of circles with arrows connecting them:

An example DFA

To see if a string matches a dfa, you start at a particular (fixed) node (which is usually marked, but I forgot... in this case it can be the circle marked '1'), and follow the arrow corresponding to the next letter in the string. So this DFA only works for an alphabet of "a" and "b". If when you get to the very end of the string you are at a circle marked with a double line (the '3') then the string matches. Otherwise it doesn't.

Strings that match: "b", "abaa", "aa", "bba", "baaaa".

Strings that don't: "a", "abab", "bb", "bbb", "abbbb".

2) When the computer produces the regexp, it generally makes one that is a bit inefficient, and has lots of circles which aren't needed. It turns out that for every possible regular expression, there is a minimal number of circles you need to express it, and when you have it drawn with this minimal number of circles, the structure of the thing you have drawn (that is, everything except for the label) uniquely describes that regular expression. So if two regular expressions are for the same language, then they both have the same minimal DFA. Therefore, we like to simplify the DFA so that we can tell if two regular expressions are actually telling us the same thing.

3) To simplify a regular expression, you look at each pair of circles, and check if they are equivalent. For them to be equivalent, they must both look the same (if one has a double circle and the other doesn't, then they aren't equivalent). And then you have to check if all their arrows lead to the same circles (for the same letters), if we were to look at circles "2" and "3", both of them lead to circle "3" if they read an "a", but they go to different places if you give them a "b", so they aren't the same (and "3" is a double-circle). When you find equivalent circles, you merge them together and check for more equivalent circles.

4) To check if a regular expression (P) is a subset of another (Q), you create a new regular expression called S which is the union of P and Q. The details of this won't make sense here (convert DFAs to NDFAs, create new NDFA which is union, reduce to DFA, simplify), but at the end of the process, S is another DFA (which is simplified) and you then check if it is equivalent to P, or equivalent to Q. If it is equivalent to P, then everything in Q is also in P (so it is a subset) and the same goes the other way around.

My mistake was in step 3, when I was checking if two circles were equivalent, you want to assume the circles are equivalent when doing the check. I didn't do that, so if there were two circles which both pointed only to themselves, then they should be equivalent to each other, but my program didn't notice that.

The updated source (still C code, GPL3, with makefile) fixes this (the only change is in RegExpDfa.c).

Fun with rand()

Mon, 10 Dec 2007

Compile and run with gcc blah.c && ./a.out:

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
    int x[2] = {618270585, 58852637};
    int i;
    for (i = 0; i < 2; i++) {
        srand(x[i]);
        while (1) {
            int c = rand() % 27;
            if (c == 0) {
                break;
            }
            putchar('a' + c - 1);
        }       
        putchar(' ');
    }
    putchar('\n');
}

In general, when someone says to compile and run some random code on the internet you should ignore them (since there are enough people out there writing dodgy code that will intentionally delete your stuff). I tell people to do it all the time, but typically it is with code where if you read it it would be obvious what it is doing. It isn't obvious what this code will actually do, but if you know a little bit of c, you can probably tell that it won't do anything bad.

When I run it on my webserver, I get this output: "zggwkrqcikhqycq kadmsdicbknhua". When I run it on my desktop I get a much more pleasant message.

Where Do I Come From?

Mon, 10 Dec 2007

wheredoicomefrom is a (potentially useful) tool for anyone using a debian-like system (such as Ubuntu). You tell it a file, and it tells you what package (if any) was responsible for putting the file there. If it isn't installed by any packages, then it tells you so. It is a pretty simple application, it just reads through all the files in /var/lib/dpkg/info to see which package installed it. But, I can actually imagine this being useful.

Get the public domain ruby source.

Subgrep

Mon, 03 Dec 2007

I've had this program lying around for a while, and not really known what to do with it. I wrote it a while ago when I was trying to work out my coding style conventions in C. I thought that the best way to do this would be to build a mediumish project and think a little bit about lots of the decisions I made, and when I realised I stuffed up, I would try to go back and change everything. I don't know how successful it was (as in, "I don't know", not "it wasn't successful, but I'll say 'I don't know' to avoid admitting failure") and at least one thing has changed since then (I now name functions and variables using underscores not camel case, so code looks prettier when using GTK+ libraries).

Skip this if you know what regular expresions are: A regular expresion is a description of a set of strings. Like "the set of all strings with 3 vowels in a row" or "the set of all strings ending in "bacon". They are useful for computer nerds who have a big file which has a little bit of stuff in it they want, but they don't know exactly what the stuff is (or else they wouldn't need to find it). So you create a regular expression that matches things that look like what you are after (say "four numbers, then a space, then 4 more numbers, optionally preceeded by two numbers in brackets" for phone numbers) and then run a program which looks for things which match.

Anyway, the program is a tool which I have no idea what you would actually use it for, but still is more a tool than a toy. It takes a regular expression, and a text file, and it then treats each line in the text file as a regular expression and tells you if it is a subset of the given expresion, or the other way around, or if they are equivalent or otherwise. I call it subgrep, because it is like grep, but it does subsets. (For those not in the know, "grep" takes a regular expresion and a file and tells you which lines in the file match the regular expression).

Going through uni, I think I was taught regular expressions about 3 times in different subjects (one was practical, one was maths-purist, one was computer science-purist), but I don't think any of them really cared about subsets. Pretty much the only thing the purists cared about was the equivalence to DFAs and NDFAs (which this code makes use of, a regular expression is converted to an NDFA, then to a DFA). The practical people didn't care about much. So, this code gives a bit of a bigger spectrum, it has a tokeniser, a parser, a parser to NDFA converter, an NDFA to DFA converter, a DFA simplifier, and a few other operations on DFAs (like unions and equivalence, which are then combined to do subsets).

The challenge now is to find an actual use for this tool. I haven't thought of one, but maybe someone will find it useful: C source code (GPL3).

Text Similarity

Mon, 26 Nov 2007

Here is a neat use of zlib, which is a library for compressing stuff (like how you use zip files for compressing stuff). Compression works best when the stuff you are compressing is self-similar. So, it will probably compress "aaaaaaa" better than "abcdefg". So, if you had a document that was written in English, but then you tacked on a little bit of German stuff, then typically it won't compress to being quite so small as if you had tacked on the same amount of English stuff (because the two languages use different words and have different letter frequencies and so on). The difference is slight, but present.

So, you can use this to write a program which knows the difference between different languages. And because plenty of people have already done the hard work, it is a nearly trivial program:

require 'zlib'

def compare_strings(a, b)
    al = Zlib::Deflate.deflate(a).length
    bl = Zlib::Deflate.deflate(b).length
    abl = Zlib::Deflate.deflate(a + b).length
    bal = Zlib::Deflate.deflate(b + a).length

    return [abl, bal].min.to_f / (al + bl).to_f
end

This function gives a 'distance' between two strings. A smaller number means they are more similar. A larger number means they are more different.

Some examples:

compare_strings("abcdefg", "aaaaaaa") => 0.653846153846154
compare_strings("aaaaaaa", "aaaaaaaaaaaaaaaa") => 0.5
compare_strings("aaaaaaa", "aaaaaaaaaaaab") => 0.521739130434783
compare_strings("aaaaaaa", "bbbbbbb") => 0.636363636363636

So, I downloaded a few of the Leipzig Corpora and grabbed 3 chunks of the English, German and Estonian corpora (each 20 sentences long) and ran them through a bit of wrapper code to compare everything to each other, then rank the distances:

The results were it ranked all the things in the same language between 0.511 and 0.960 (low numbers were for when it compared a file with itself). And all the things in different languages between 0.980 and 0.994. If you notice that those ranges don't overlap, so every single pair of files written in the same language had a shorter distance than every single pair of files in different languages. I would expand the test to more languages, but I much prefer being able to say that it provides a perfect classifier.

What is super cool about this is that the algorithm is really dumb, which means that it will work if we add more languages, or even do it on programming languages, or potentially other data that compresses well.

If you feel like playing with it, you can download my hacky driver code too (public domain) (includes the test data I used so you can verify my results).

Frequency Tracking

Mon, 19 Nov 2007

This is a fairly simple program for doing frequency tracking. Frequency tracking is where you take a sound and try to follow the dominant pitch of that sound. For lots of sounds this dominant pitch doesn't really exist (e.g. a cymbal sound, or multiple sounds playing at once) so I chose to measure the frequency, confidence and whether or not a dominant pitch exists.

There are heaps of ways of doing this, I think at the moment my pet favourite is to use a filter bank. A filter bank is just a whole bunch of bandpass filters covering bands over a bit of the frequency spectrum. So one might cover 50Hz to 60Hz, the next from 60Hz to 75Hz etc. When I first started doing stuff in the frequency domain, I really wanted to make sure that there was no overlap between such bands. After doing lots of stuff in this area, I think I'm of the opinion that it isn't really possible, and even if it was it wouldn't be that good. So, while each filter tries to cover a specific area, they do overlap somewhat.

The magic numbers I've been using mean that I have 200 filters covering the spectrum from 50Hz to 2KHz (spaced in a geometric series, so they are closer together down the bottom). The filter design is pretty simple, I have 201 low pass filters, which work by making each sample a running average of the previous samples (i.e. f[t] = f[t-1]*alpha + f[t]*(1-alpha)) where alpha is chosen to put the cutoff frequency in the right spot. I then apply the filter to a section of sound (512 samples long) and subtract the results to get bandpass filters. So I might subtract a 200Hz lowpass from a 210Hz lowpass to give a 200-210Hz bandpass filter. I then just measure the RMS energy in each band.

Rather than just finding the maximum energy band and declaring it the winner, I keep something resembling a probability distribution over all the bands. It starts out with uniform probability, and then each probability is multiplied by the amount of energy present in that band, and a constant is added (so that we don't get stuck at 0). After this we find the total of all the numbers (which in general won't add up to 1 now) and remember this number as the confidence. We then normalise the numbers, and the largest one is considered the winner.

If the winner is at the high end of the spectrum, then we say it is unpitched. It seems right in practice.

To test it, I made a program which will take a wave file and add a sine wave over the top of the pitch it detected. I don't have many sounds where it is just a single instrument, so I don't have anything particularly good to test it on. But, it seems to go ok, just not when there are multiple instruments.

You can download the pitch tracker here (C code, GPL3, builds in amkel). And you can get some wav files to test it on from youiseek.com (CC Attribution Noncommercial Share Alike) (a local band... not particularly good for testing this on, but given that my wave file loader is so picky I just went for the first thing that worked).

To build it, just compile together all the .c files with -std=c99 (or amkel pt.c). Run with the first argument being the wav file to run on.

You might also notice that I've changed my coding style. Previously I named variable and function identifiers with camelCase, now I use under_scores (I know it is one word). It makes it ugly when I interface with old code (the sound library), but I think changing/avoiding the sound library will be less work than changing/avoiding gtk. Also, it makes it very easy for me to tell which things I develop at home, and which things I do at work (since I use the other convention at work).

 

About

I'm a nerd living in Sydney. This is a place where I can write stuff about my interests and not care that no one else is reading.

I like music, maths, programming, pretty pictures, filters and other good things.

(more info)

It should be fairly obvious that this isn't connected to my employer at all.

Email me (not a catchpa)

Email policy

Subscribe

RSS Feed RSS

Get an aggregator

Liferea (Linux)

Vienna (OSX)

Feedreader (Windows)

Google Reader (Web based)

I've only used Liferea, so I can't vouch for the other ones.

About this site

This site runs a (modified) version of blosxom.

The host is GeekISP, and they seem to do an excellent job.