Subgrep

Mon, 03 Dec 2007

I've had this program lying around for a while, and not really known what to do with it. I wrote it a while ago when I was trying to work out my coding style conventions in C. I thought that the best way to do this would be to build a mediumish project and think a little bit about lots of the decisions I made, and when I realised I stuffed up, I would try to go back and change everything. I don't know how successful it was (as in, "I don't know", not "it wasn't successful, but I'll say 'I don't know' to avoid admitting failure") and at least one thing has changed since then (I now name functions and variables using underscores not camel case, so code looks prettier when using GTK+ libraries).

Skip this if you know what regular expresions are: A regular expresion is a description of a set of strings. Like "the set of all strings with 3 vowels in a row" or "the set of all strings ending in "bacon". They are useful for computer nerds who have a big file which has a little bit of stuff in it they want, but they don't know exactly what the stuff is (or else they wouldn't need to find it). So you create a regular expression that matches things that look like what you are after (say "four numbers, then a space, then 4 more numbers, optionally preceeded by two numbers in brackets" for phone numbers) and then run a program which looks for things which match.

Anyway, the program is a tool which I have no idea what you would actually use it for, but still is more a tool than a toy. It takes a regular expression, and a text file, and it then treats each line in the text file as a regular expression and tells you if it is a subset of the given expresion, or the other way around, or if they are equivalent or otherwise. I call it subgrep, because it is like grep, but it does subsets. (For those not in the know, "grep" takes a regular expresion and a file and tells you which lines in the file match the regular expression).

Going through uni, I think I was taught regular expressions about 3 times in different subjects (one was practical, one was maths-purist, one was computer science-purist), but I don't think any of them really cared about subsets. Pretty much the only thing the purists cared about was the equivalence to DFAs and NDFAs (which this code makes use of, a regular expression is converted to an NDFA, then to a DFA). The practical people didn't care about much. So, this code gives a bit of a bigger spectrum, it has a tokeniser, a parser, a parser to NDFA converter, an NDFA to DFA converter, a DFA simplifier, and a few other operations on DFAs (like unions and equivalence, which are then combined to do subsets).

The challenge now is to find an actual use for this tool. I haven't thought of one, but maybe someone will find it useful: C source code (GPL3).

Oh, and because this is old code it actually has a (working) Makefile, so you don't need amkel :)
Andy
1 minute after story
Well, is this right?

$ ./subgrep ".*" test
>= a.b.c
!= [aeiou]+
!= .*b.*
>= exactly
>= b(an){1,10}a
>=
ivan
2 days after story
Nope, that's a bug in my code for simplifying the DFA when it is checking if states are equivalent. My bad.

This replacement code (to go in RegExpDfa.c) should temporarily bandage the problem (I don't think this fixes the general problem, I just want to go to bed)

static _Bool statesEquivalent(RegExpDfaState *a, RegExpDfaState *b) {
// Both must accept, or both must reject
if (a->accept != b->accept) {
return 0;
}
// Both must connect to the same states, the same way
for (int i = 0; i < MAX_CHARS; i++) {
if (a->connections[i]->id != b->connections[i]->id) {
if (!((a->connections[i]->id == a->id || a->connections[i]->id == b->id) &&
(b->connections[i]->id == a->id || b->connections[i]->id == b->id))) {
return 0;
}
}
}
return 1;
}



$ ./subgrep ".*" test
>= a.b.c
>= [aeiou]+
>= .*b.*
>= exactly
>= b(an){1,10}a
>=


Andy
2 days after story
Name & email are optional. Email will not be obfuscated.
HTML tags will be removed except hyperlinks.
 

About

I'm a nerd living in Sydney. This is a place where I can write stuff about my interests and not care that no one else is reading.

I like music, maths, programming, pretty pictures, filters and other good things.

(more info)

It should be fairly obvious that this isn't connected to my employer at all.

Email me (not a catchpa)

Email policy

Subscribe

RSS Feed RSS

Get an aggregator

Liferea (Linux)

Vienna (OSX)

Feedreader (Windows)

Google Reader (Web based)

I've only used Liferea, so I can't vouch for the other ones.

About this site

This site runs a (modified) version of blosxom.

The host is GeekISP, and they seem to do an excellent job.