Guess text encoding with scalar product

Playing with an 8-bit to multibyte text converter recently I came across with a rather interesting problem. The converter is intended to help with converting legacy text data encoded with older monobyte character encodings into UTF-8. Yes, the converting process itself is not a big deal. In fact it’s a very basic and straightforward program a student would usually challenge him- or herself just after they accomplished the “Hello World!” one. However, the interesting part here is the fact that that “old encoding” can be one of a bunch of variants of slightly different ad hoc character sets. And to make things worse, the user doesn’t usually know which code page exactly the original text was encoded with. And as always the user is reluctant to make any effort to find it out before using the tool. Where such ad hoc code pages came from and why they been used at all is a whole different story but it’s not the point here. The point is how having only a piece of plain text to determine which code page it is encoded with. So…

Yes, I know that There Ain’t No Such Thing As Plain Text. But I can assume that the entered text is written in a particular language. Let’s assume that as an initial condition: the application is designed as a helper tool for those who manipulate text data in Sakha (ISO 639: sah). No problem if you don’t know what Sakha is. As this article is in English let’s use an analogy. Imagine that we have 5 different character encodings where most of the English alphabet is encoded the same way as ASCII except just 3 letters, let’s say, A, B and C. And among these three our character sets sometimes repeat one another, intersect one with another and replace each other. And for the sake of example let’s ignore upper and lower case. So we’ve got:

Table 1. Character sets.

  A B C D E etc.
ASCII 65 66 67 68 69
John’s encoding 67 65 66 68 69
Mary’s encoding 67 66 65 68 69
Jack’s encoding 65 66 176 68 69
Bob’s encoding 64 176 255 68 69

We don’t know yet which encoding the text was originally encoded with. But we know that the text is in English and it was created with an application that used one of the above character sets. Also we can, or the computer that we’re designing the algorithm for is able to, count the occurrences of each letter of the alphabet assuming each charset at a time and then compare the results. Looping and counting, that’s what computers do best. But, compare to what? – That’s a very good question. Well, in the best scientific traditions the answer should be: “Compare to the correct one indeed”. But how useful is this answer. Let’s see. How about if “the correct one” is a statistically average English text in a known encoding? Let’s get a large enough article in English encoded in a known charset and dully count the occurrences of each letter in it. E.g. an article in UTF-8 at Wikipedia contains the following frequencies:

Table 2. Sampled frequencies in a real English text.

  A B C D E etc.
Occurrences 5947 911 2352 2573 7954

If curious the article at Wikipedia was “English language” but it doesn’t matter. Of course, these numbers may vary from one article to another. The ratio however should stay about the same deviating slightly maybe depending on various things like style, topic discussed, vocabulary used etc. But deviation will be rather subtle I hope. At least subtle enough to have the algorithm work. That’s the pivoting idea of this project. So let’s count. Exempli gratia I’ll give you a different text (The Declaration of Independence) but won’t tell you its character encoding yet. And let’s pay attention only at 3 letters – A, B, C in all character sets and ignore everything else. Firstly, because the codes for the remaining (26-3) = 23 letters are the same. And secondly, to simplify our example calculations.

Table 3. Frequencies as if encoded in different character sets.

  A B C
ASCII 562(65) 105(66) 213(67)
John’s encoding 213(67) 562(65) 105(66)
Mary’s encoding 213(67) 105(66) 562(65)
Jack’s encoding 562(65) 105(66) 0(176)
Bob’s encoding 0(64) 0(176) 0(255)

Even a brief glance at the Table 3 suggests us that the text is definitely not in Bob’s encoding because none of his three codes occurs in the input data. Although in theory it may happen to a very short text but here we can see that other character sets show much better chances than Bob’s. Similarly, Jack’s encoding is also less likely candidate as it claims no “C” letter in the whole text (in The Declaration of Independence, duh!). OK, all this is a good guesswork. Now let’s prove it scientifically.

The idea

Dot Product CosineLet’s consider the rows of the Table 3 as vectors in an Euclidean space. In that case the most correct vector, that we can bet on being the encoding we’re looking for, will be the one which is closest to a known one, ideally collinear with it. And the known one would be the one that we sampled previously, see Table 2. ఇప్పుడు two vectors are considered collinear when the angle θ between them = 0.0, or when

cos(θ)=1.0

In other words, closer the vectors to each other less the angle θ, or cos(θ)→1.0. But how are we going to find the angle θ? Luckily for us, it happens that

cos(θ)=(x·y)/(‖x‖‖y‖)

where x and y are the two vectors. The dot product (scalar product) of two vectors is calculated by summing up together all products of the corresponding coordinates. And the length is basically the square root of the dot product of a vector with itself. Pretty simple, uh? Now, taking each row from Table 3 as the given vector we can calculate angle θ with the sample vector from Table 2. Actually, cos(θ) will suffice.

The implementation

As soon as we’ve got the idea the implementation is very simple. Even a machine can do the job. And it’s better to give it to the machine. Basically what we need is to have a machine to:

  1. Take each row from Table 3
  2. Multiply its column A by column A from Table 2, same for B*B, C*C
  3. Add the three products alltogether – this is the Dot Product
  4. Multiply each column by itself, sum them and get the sqrt from it – this is the Length
  5. Get the Length of the sample vector the same way from Table 2
  6. Divide the Dot Product by both the lengths – this is the Cosine

Here are the results:

Table 4. Cosine values.

  cos(θ)
ASCII 0.999405
John’s encoding 0.51397
Mary’s encoding 0.681061
Jack’s encoding 0.930862
Bob’s encoding NaN

Now we just have to choose which of them is the closest to 1.0. Since the maximum value a cosine can ever get is also 1.0, we may just get the absolute max from the list and that will be our best candidate. In the Table 4 it just happens to be ASCII with the value 0.999405. That is very close to 1.0 I’d say. Of course it was ASCII, the original code page. Where I’d find the text of The Declaration of Independence encoded in John’s or Mary’s or any other imaginary encodings? If I did however, I tell you, the algorithm would work and we’d see the true original code page as clearly as .999.

Here is one real implementation of this logic in JavaScript: PseudoYKT decoder at Google Code. It’s a Google Gadget that being embedded in a Google Site’s page can help to decode a legacy text from one of the ad hoc character sets automatically detecting the correct one. I doubt if anyone can use this gadget as it is in an application other than decoding text in Sakha from very custom charsets. But I’d definitely be happy if my tiny project inspired somebody to a project of their own. I don’t mind if you use any piece of this code or the whole gadget for any purposes. Please let me know if you did. Thank you.

PS. Same or similar approach can be used whenever you want to recognize a piece of information by calculating its proximity to known samples. What is needed is to find a way to quantify its identifying features. It can be as simple as counting the occurrences or somewhat more sophisticated as finding Fourier series coefficients.

Paypal Payment vs. Micropayment comparison

If you sell stuff online and receive payments using Paypal you probably already know that you have two options for the payment fee with Paypal. One is the standard 2.9% commission + ¢30 transaction fee. The other is so called Micropayment Discount or 5% + ¢5 per transaction. The latter, as Paypal says, is better if your typical sales are less than $10. But how much better and what is typical? – are the questions that neither Paypal nor anybody else can answer for you. Here is a simple graph showing the total percentage you’d pay to Paypal from your merchant profit depending on the transaction amount. The graph is to help you understand your odds.

Graph: two descending exponents starting at 10% and 30% intersecting at $12 continuing toward 5% and 3%

Below is my humble “analysis”. For ridiculously small amounts, like less than a dollar, collecting any money might be not a very good idea at the first place. You’d gain more if you gave it for free. So lets start with $1, a simple smartphone app for example. If you didn’t apply for Micropayment Discount you’d give 33% (every third dollar) of your profit to Paypal, or just 10% (every tenth dollar) if you did. And the difference is pretty much significant for anything significantly less than $10.

If your merchandise costs mostly around $10, especially if it’s assorted and the price is spread up to $15 or more then you might don’t want even to bother to apply for Micropayment Discount. It’s not going to save you much. Moreover, if you’re closer to $20 side you’re going to lose with Micropayment Discount. Stay standard.

For larger payments, $30 and more, the graphs flatten around 3% and 5% respectively. This means that if your Paypal account configured with Micropayment Discount you’re going to pay 2% more for Paypal’s processing services than you’d do with their standard fee plan. Not a big deal for occasional moderate spikes while majority of your sells are under $10. But definitely no-go for really big payments and/or when most of your transactions are above $10.

BLINK℠ and NFC, the swipeless credit card – what’s the point?

Hearing all this buzz about cell phone manufacturers starting to offer credit card services using the near field communication (NFC) interface in their devices, I noticed that my Chase credit card has the BLINK℠ thing in it. It’s not that I didn’t know about it before, I remember (vaguely) how I received some years ago a renewed card with a colorful booklet describing a bunch of benefits the new BLINK℠ technology carry. Although the booklet was designed in a mood how it is impossible to stay alive anymore swiping credit cards old school, I was not very convinced that time. OK, I tried the new card once at a fast food register’s BLINK℠ marked reader to make sure it works. Well, it worked. And that’s about it. The question is, is it really “Fast, Easy and Secure” as it is advertised?

Fast? Maybe. It takes a split of a second to transfer the credit card number to the register. Let’s say I save a whole second per transaction using BLINK℠ against the classic swipe, which is very generous estimate I’d say. Now let’s say I use my card every day (which is hardly true, but come on, for the sake of estimate…). That’s going to be 365 seconds a year. Wow! The whole 6 minutes! It’s like cutting a line of a few people at a coffee shop once in a year. What a super time saver! Sorry for sarcasm, couldn’t help it.

Easy? Well, for example, you may even tap your wallet at the card reader without bothering to pull your card out. However it’s going to work only if the credit card you want to put the charge on is the only RFID/NFC equipped card in it. If you have however an office access ID card or parking ramp entrance card or any other RFID device in your wallet they might interfere with each other. OK, let’s say you’re fine. But you still have to use your hand(s) to do it. OK, maybe if merchants mount the card readers a special way at the registers for customers’ convenience so you can scan your BLINK℠ card by merely turning around and touching the reader with the part of your body where your wallet is usually tucked in a pocket, then it might be noticeably easier.

Secure? This is a very big question. Even though Chase claims that the card can be scanned only at 2 inches or less away from the reader, I doubt that there is no way to craft a more sensitive device that could read RFIDs or NFC interface from a longer distances. Anyways, it requires a split of a second to be at 2 inches proximity to read the card ID. And the transparency of leather and denim for it, what makes the technology so convenient, also makes it so vulnerable.

So what’s the point? Even replacing the plastic credit card with a smartphone, how is it different from replacing an expired credit card with the renewed one with that BLINK℠ thing in it? It’s not going to make the transactions “faster, easier and more secure”. It’s not going to lower the transaction costs which eventually are passed to the consumers anyway. It’s not going to shorten the lines at the cash registers. Or, wait, is it? What if this time it may allow you bypass the line at all?

Look. I have (you have, many other people have) a quite powerful computing device in a pocket. The device is equipped with a hi-res camera, hi-speed internet connection and, well, NFC interface. Technically, with a carefully designed software, it is possible to scan the merchandise ID (the bar code with the camera or the anti-theft RFID tag with the NFC), send a transaction request to the merchant’s server, communicate with the bank or the credit provider to authorize the transaction, connect the merchant with the bank, and upon approval close the transaction. It’s basically putting the whole cash register into the hand-piece. That would be really fast (no waiting in the line) and easy (beep-beep, done) and secure. Yes, secure. With proper digital keys exchanging among the buyer, the merchant and the banks over strong-encrypted connections there will be more room to achieve much higher levels of security than with a cheap microchip in the plastic credit card. Plus, the RFID tag of the bought merchandise can be immediately removed (marked) from the merchant’s database so the EAS alarm at the store entrance will not go off when you walk out with the just purchased good. Self-check-out lane? – No, thanks. I’ve already checked out myself.

Not a very job-creating idea. And it requires changing many habits, both shopping and selling. But technically it’s already feasible. Isn’t it?

Update 3/1/2012: Ignacio Mas just mentioned similar idea today in his article “Me, My Money, and My Devices” at Technology Review: “Now that we have a virtual card and card reader right in our pocket in the form of a smart phone, who will be content to carry a credit card we cannot ourselves read?” I’m glad we’re on the same page, Ignacio, we’re getting there.

Google+ circles

OK, as far as I could find so far the major, if not the only, advance Google+ have against Facebook is their circles concept. But I’m still a little confused what would be the right way to use them. The circles are great, I agree, they prop your networking up into multidimensional space. However in my opinion it’s still a half-way solution. And here is why I think so.

1. The circles are sharing circles, not the reading ones. Or not? I create circles and put certain people in them to control whose streams I’m going to clatter with my posts. I’m sure that some geeks are absolutely not interested in my kid’s photos, and my relatives are least likely would appreciate my nerdy essays in their stream as well. OK, that’s how I understood the circles at the first glance. And it seems to work this way, I mean one-way. But why all my the same circles are used as filters for my own reading stream. Does it mean that I have to carefully redesign my circles to make use of them for my reading? Should I create a circle “most interesting” and add and remove people to it based on their posts’ interestingness? Then there will be two kinds of circles, outgoing and incoming. Will they be there? Double the number of dimensions?

2. Circles tend to produce more circles. Wherever there are overlapping circles there will be unions, intersections, differences, products etc. It’s the basics of the Set Theory. Examples? Well, union is the simplest, and it’s already implemented in Google+. Just check as many circles as you want when you’re sharing something – that will be a union. Intersection? Let’s say I have “Geeks” and “Can read Russian” circles. When I post something geeky in Russian, guess who would I like to address the post? I believe the intersection of “Geeks” and “Can read Russian” should be my target. Next the difference. Let’s say in addition to the previous two I also have “Coworkers” circle. Again, all of these three circles can overlap easily. What if I want to subtract “Coworkers” (including my geeky boss) from “Geeks” circle if I’m going to ask an advice on, let’s say, job hunting from all my geeky friends? So there should be a subtraction functionality in G+ circles too. I hope Google will think about all this math.

So there is plenty of work to do to implement the circles properly. The only hope is that Google has enough resources to do it right. Maybe it’s not time yet, and that’s why the enrollment to G+ is limited for now. For the meantime Google+ is just raw. Well, no problem, we can wait.

Amazon Web Services

I know that to have your own personal blog is cool, but to have it on your own web host is double cool. That’s true. However what can be even cooler to have your web host running in your own Linux virtual box? No, not running it in a real box in your garage. It was cool, even geeky, but some years ago. These days everything more or less advanced goes into… clouds. Yes, cloud computing they call it. Is it a right thing? – Yes, I want an answer on this question too. And there is only one way to find out – to try it on.

So that is it. This particular web-site, being my beloved guinea pig, has moved onto Amazon Elastic Compute Cloud (EC2) to test the Amazon Web Services in field conditions. Luckily Amazon gives their new customers Free Usage Tier, a package of free usage enough to run a “micro” instance of a Linux round a clock for a year. The “micro” instance means 613MB of RAM. One must acknowledge that this “Free Usage Tier” doesn’t mean free service – Amazon is going to charge for every tiny inhale and separately for an exhale in the cloud. The only hope is they cost pennies each. What amount those pennies are going to pile up in in a month, that’s what I’m going to find out and compare to my current Virtual Private Server, which is $20 a month flat for 512MB RAM virtual machine. I specifically do not mention other resources like disk space and CPU GHz since they are not as significant as RAM is and don’t actually define the price. We will see.

How is Amazon? – So far not bad. A bit complicated to get started, but one can call it being geeky and be cool oneself. Good thing everything is well documented and explained at the AWS web-site. Also it seems like the community of fellow customers is scarily large. How is performance? – My ssh terminal logged in this box feels much more responsive then with my old VPS. I think it’s just the disk I/O at my old service provider needs improvement. Everything else works fairly well on both sites, so can’t compare yet. I didn’t do any benchmarks and am not going to. Next step will be to put some load (move more web-sites) onto the machine and see how it feels in the web-browser. I’ll keep you updated.