Re: Encoding of Numbers Composed of Decimal Digits (General Category of Nd)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Mon, 30 Apr 2012 21:01:05 +0100

On Mon, 30 Apr 2012 13:46:20 +0200
Michael Probst <michael.probst03_at_web.de> wrote:

> Am Samstag, den 28.04.2012, 13:18 +0100 schrieb Richard Wordingham:
> > Is it anywhere stated as policy that numbers written by a string of
> > decimal digits will be encoded with the most significant digit
> > first in storage order? I couldn't find it stated anywhere.
>
> Isn't this about encoding characters, mapping computer readable
> numbers to human readable characters (which may be digits), but not
> about encoding numbers, just as this is not about encoding words?

The comparison is appropriate. The Unicode has Standard Annexes 14
'Unicode Line Breaking Algorithm' and 29 'Unicode Text Segmentation'.
A lot of ill-will has been generated by forcing most Indic script users
to store their characters in roughly phonetic order. The best
justification is that it facilitates sorting and linguistically
sensitive processes like transliteration and, I hope, spell-checking in
highly inflected languages.

Thai is the principal resister, with Thai-like scripts sheltering
behind it. (Actually, Thai collation seems to have been designed for
computers - you don't have to know how a word is pronounced to apply the
rules properly. Interestingly, Thais seem to apply the rules on the
basis of pronunciation and get the results slightly wrong!)

I have been told that the order should be based on collation rather
than on phonetics, in which case all the vowels in the Myanmar script
are exceptions to the principle, for CVC words are sorted on the basis
of first consonant, final consonant and then vowel. There are two major
varieties of Lao collation - first, vowel, final and first, final,
vowel. Last time I looked at the defaults for the Unicode Collation
Algorithm, the first was implemented very imperfectly (vowel ordering
in Lao is very UCA-unfriendly) and the latter seemed a great challenge
for a concise tailoring that doesn't come close to listing all the
possible rhymes without tone marks.

The basic file of the Unicode Character Database has three different
fields for numeric value, so someone cares a great deal about the
interpretation of numerical characters. There is a hierarchy of
numeric types ranging from decimal (digit), to digit, to numeric - see
http://www.unicode.org/Public/UNIDATA/extracted/DerivedNumericType.txt .
Only the first is totally aligned with a general category.

> Arabs store (and read, and understand) the least significant digit of
> a number first, on the right, on paper. <snip>
>
> "although digits run the other way, making the scripts
> inherently bidirectional"
>
> http://unicode.org/faq/bidi.html#0
>
> I don't think people writing Ivrit or Arabic perceive their writing as
> bidirectional.

Just their computers! It will be much better if the standard comes
clean and admits that bidirectionality is a result of insisting on
storing digits in decreasing order of significance. Does anyone here
know the history of this encoding order for Arabic digits? I can guess
that occasional data corruption in swapping the storage order of letters
from left to right to right to left was tolerable, whereas accidentally
reversing the digits in numbers recorded as text but functioning as
numbers could have been catastrophic.

Actually, some users of the Arabic script may feel that their numbers
are written in a funny order. Or perhaps not - I've never felt that
calling the least significant bit of a byte bit 0 was bizarre.

> > As positional notation only seems to have been invented and
> > propagated once or twice (Babylonian and Indian inventions),
>
> I don't think the Mayas copied this idea from hi.wikipedia.org ;-)

I stand corrected.

Richard.
Received on Mon Apr 30 2012 - 15:02:38 CDT

This archive was generated by hypermail 2.2.0 : Mon Apr 30 2012 - 15:02:38 CDT