Text Similarity
Here is a neat use of zlib, which is a library for compressing stuff (like how you use zip files for compressing stuff). Compression works best when the stuff you are compressing is self-similar. So, it will probably compress "aaaaaaa" better than "abcdefg". So, if you had a document that was written in English, but then you tacked on a little bit of German stuff, then typically it won't compress to being quite so small as if you had tacked on the same amount of English stuff (because the two languages use different words and have different letter frequencies and so on). The difference is slight, but present.
So, you can use this to write a program which knows the difference between different languages. And because plenty of people have already done the hard work, it is a nearly trivial program:
require 'zlib'
def compare_strings(a, b)
al = Zlib::Deflate.deflate(a).length
bl = Zlib::Deflate.deflate(b).length
abl = Zlib::Deflate.deflate(a + b).length
bal = Zlib::Deflate.deflate(b + a).length
return [abl, bal].min.to_f / (al + bl).to_f
end
This function gives a 'distance' between two strings. A smaller number means they are more similar. A larger number means they are more different.
Some examples:
compare_strings("abcdefg", "aaaaaaa") => 0.653846153846154
compare_strings("aaaaaaa", "aaaaaaaaaaaaaaaa") => 0.5
compare_strings("aaaaaaa", "aaaaaaaaaaaab") => 0.521739130434783
compare_strings("aaaaaaa", "bbbbbbb") => 0.636363636363636
So, I downloaded a few of the Leipzig Corpora and grabbed 3 chunks of the English, German and Estonian corpora (each 20 sentences long) and ran them through a bit of wrapper code to compare everything to each other, then rank the distances:
The results were it ranked all the things in the same language between 0.511 and 0.960 (low numbers were for when it compared a file with itself). And all the things in different languages between 0.980 and 0.994. If you notice that those ranges don't overlap, so every single pair of files written in the same language had a shorter distance than every single pair of files in different languages. I would expand the test to more languages, but I much prefer being able to say that it provides a perfect classifier.
What is super cool about this is that the algorithm is really dumb, which means that it will work if we add more languages, or even do it on programming languages, or potentially other data that compresses well.
If you feel like playing with it, you can download my hacky driver code too (public domain) (includes the test data I used so you can verify my results).

RSS