Text Similarity

Mon, 26 Nov 2007

Here is a neat use of zlib, which is a library for compressing stuff (like how you use zip files for compressing stuff). Compression works best when the stuff you are compressing is self-similar. So, it will probably compress "aaaaaaa" better than "abcdefg". So, if you had a document that was written in English, but then you tacked on a little bit of German stuff, then typically it won't compress to being quite so small as if you had tacked on the same amount of English stuff (because the two languages use different words and have different letter frequencies and so on). The difference is slight, but present.

So, you can use this to write a program which knows the difference between different languages. And because plenty of people have already done the hard work, it is a nearly trivial program:

require 'zlib'

def compare_strings(a, b)
    al = Zlib::Deflate.deflate(a).length
    bl = Zlib::Deflate.deflate(b).length
    abl = Zlib::Deflate.deflate(a + b).length
    bal = Zlib::Deflate.deflate(b + a).length

    return [abl, bal].min.to_f / (al + bl).to_f
end

This function gives a 'distance' between two strings. A smaller number means they are more similar. A larger number means they are more different.

Some examples:

compare_strings("abcdefg", "aaaaaaa") => 0.653846153846154
compare_strings("aaaaaaa", "aaaaaaaaaaaaaaaa") => 0.5
compare_strings("aaaaaaa", "aaaaaaaaaaaab") => 0.521739130434783
compare_strings("aaaaaaa", "bbbbbbb") => 0.636363636363636

So, I downloaded a few of the Leipzig Corpora and grabbed 3 chunks of the English, German and Estonian corpora (each 20 sentences long) and ran them through a bit of wrapper code to compare everything to each other, then rank the distances:

The results were it ranked all the things in the same language between 0.511 and 0.960 (low numbers were for when it compared a file with itself). And all the things in different languages between 0.980 and 0.994. If you notice that those ranges don't overlap, so every single pair of files written in the same language had a shorter distance than every single pair of files in different languages. I would expand the test to more languages, but I much prefer being able to say that it provides a perfect classifier.

What is super cool about this is that the algorithm is really dumb, which means that it will work if we add more languages, or even do it on programming languages, or potentially other data that compresses well.

If you feel like playing with it, you can download my hacky driver code too (public domain) (includes the test data I used so you can verify my results).

Is this distance you are talking about, the Hamming distance? (it is used in error detection and correction; ex: if a bit was flipped, you find the closest valid word and treat the received word as if it were the one it is 'nearest' to).
26 days after story
Name & email are optional. Email will not be obfuscated.
HTML tags will be removed except hyperlinks.
 

About

I'm a nerd living in Sydney. This is a place where I can write stuff about my interests and not care that no one else is reading.

I like music, maths, programming, pretty pictures, filters and other good things.

(more info)

It should be fairly obvious that this isn't connected to my employer at all.

Email me (not a catchpa)

Email policy

Subscribe

RSS Feed RSS

Get an aggregator

Liferea (Linux)

Vienna (OSX)

Feedreader (Windows)

Google Reader (Web based)

I've only used Liferea, so I can't vouch for the other ones.

About this site

This site runs a (modified) version of blosxom.

The host is GeekISP, and they seem to do an excellent job.