Monday, March 5, 2012

Guess text encoding with scalar product

Playing with an 8-bit to multibyte text converter recently I came across with a rather interesting problem. The converter is intended to help with converting legacy text data encoded with older monobyte character encodings into UTF-8. Yes, the converting process itself is not a big deal. In fact it's a very basic and straightforward program a student would usually challenge him- or herself just after they accomplished the "Hello World!" one. However, the interesting part here is the fact that that "old encoding" can be one of a bunch of variants of slightly different ad hoc character sets. And to make things worse, the user doesn't usually know which code page exactly the original text was encoded with. And as always the user is reluctant to make any effort to find it out before using the tool. Where such ad hoc code pages came from and why they been used at all is a whole different story but it's not the point here. The point is how having only a piece of plain text to determine which code page it is encoded with. So...