Re: Unicode, SMS and year 2012

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Sat, 28 Apr 2012 14:02:30 +0900

On 2012/04/28 4:26, Mark Davis ☕ wrote:
> Actually, if the goal is to get as many characters in as possible, Punycode
> might be the best solution. That is the encoding used for internationalized
> domains. In that form, it uses a smaller number of bytes per character, but
> a parameterization allows use of all byte values.

Because punycode encodes differences between character numbers, not the
character numbers themselves, it can indeed be quite efficient in
particular if the characters used are tightly packed (e.g. Greek,
Hebrew,...). For languages with Latin script and accented characters,
the question is how close these accented characters are in Unicode.

However, punycode also codes character positions. Because of this, it
gets less efficient for longer text.

[Because punycode uses (circular) position differences rather than
simple positions, this contribution is limited by the (rounded-up binary
logarithm of the) weighted average distance between two same characters
in the text/language.]

My guess is therefore that punycode won't necessarily be super-efficient
for texts in the 100+ character range. It's difficult to test quickly
because the punycode converters on the Web limit the output to 63
characters, the maximum length of a label in a domain name.

Regards, Martin.
Received on Sat Apr 28 2012 - 00:09:24 CDT

This archive was generated by hypermail 2.2.0 : Sat Apr 28 2012 - 00:09:32 CDT