Subgrep Fix
As pointed out to me in the comments, I had a bug in the subgrep program last week. I'll explain it, but first detail how the whole thing works:
1) All of the search patterns are represented by a DFA, when you draw a DFA on paper it is a bunch of circles with arrows connecting them:

To see if a string matches a dfa, you start at a particular (fixed) node (which is usually marked, but I forgot... in this case it can be the circle marked '1'), and follow the arrow corresponding to the next letter in the string. So this DFA only works for an alphabet of "a" and "b". If when you get to the very end of the string you are at a circle marked with a double line (the '3') then the string matches. Otherwise it doesn't.
Strings that match: "b", "abaa", "aa", "bba", "baaaa".
Strings that don't: "a", "abab", "bb", "bbb", "abbbb".
2) When the computer produces the regexp, it generally makes one that is a bit inefficient, and has lots of circles which aren't needed. It turns out that for every possible regular expression, there is a minimal number of circles you need to express it, and when you have it drawn with this minimal number of circles, the structure of the thing you have drawn (that is, everything except for the label) uniquely describes that regular expression. So if two regular expressions are for the same language, then they both have the same minimal DFA. Therefore, we like to simplify the DFA so that we can tell if two regular expressions are actually telling us the same thing.
3) To simplify a regular expression, you look at each pair of circles, and check if they are equivalent. For them to be equivalent, they must both look the same (if one has a double circle and the other doesn't, then they aren't equivalent). And then you have to check if all their arrows lead to the same circles (for the same letters), if we were to look at circles "2" and "3", both of them lead to circle "3" if they read an "a", but they go to different places if you give them a "b", so they aren't the same (and "3" is a double-circle). When you find equivalent circles, you merge them together and check for more equivalent circles.
4) To check if a regular expression (P) is a subset of another (Q), you create a new regular expression called S which is the union of P and Q. The details of this won't make sense here (convert DFAs to NDFAs, create new NDFA which is union, reduce to DFA, simplify), but at the end of the process, S is another DFA (which is simplified) and you then check if it is equivalent to P, or equivalent to Q. If it is equivalent to P, then everything in Q is also in P (so it is a subset) and the same goes the other way around.
My mistake was in step 3, when I was checking if two circles were equivalent, you want to assume the circles are equivalent when doing the check. I didn't do that, so if there were two circles which both pointed only to themselves, then they should be equivalent to each other, but my program didn't notice that.
The updated source (still C code, GPL3, with makefile) fixes this (the only change is in RegExpDfa.c).

RSS