Re: Arabic 16-bit encodings

From: Gregg Reynolds (
Date: Thu Jun 30 2005 - 10:13:36 CDT

  • Next message: Antoine Leca: "Re: Lowercase in one script, capital in another"

    N. Ganesan wrote:
    > Gregg Reynolds ( wrote
    >>You are not alone in thinking Unicode does not
    >>serve your language community, but don't forget
    >>it was never Unicode's intention to serve
    >>language communities. It's just a character
    >>encoding, not a language encoding. Unicode
    >>happens to also do serious damage to the entire
    >>world of right-to-left languages such as Arabic (IMO),
    >>but it had no choice, given that it was constrained
    >>to adopt legacy encodings. No point in whining
    >>about that. And it is probably better than what
    >>we had before. Still, it is up to than language
    >>community to decide to do something better.
    > For resources and other practical difficulties,
    > I think Unicode will be the only one 16-bit
    > encoding for Tamil for a long time. Haven't even
    > heard of someone coming up with competition.
    > But Tamil script being a script with only
    > non-conjuncts (unlike eg. Devanagari or
    > Tamil Grantha scripts), many 8-bit glyph based
    > encodings still exist in the web. But
    > they are not searcheable via Google & so on.
    > So, some 500+ blogs operate exclusively
    > in Unicode.
    > What about Arabic script? The Middle East
    > awash with funds and resources, and the script is in a
    > wide area by lots of people. If
    > "Unicode happens to also do serious damage
    > to the entire world of right-to-left languages",
    > is there a competition? Any 16-bit encodings
    > for Arabic script other than Unicode?


    I'm not aware of any 16-bit encodings for Arabic other then Unicode.
    There are plenty of 7 or 8 bit encoding and transliteration schemes, but
    most of them use more or less the same character repertoire as Unicode.
       (Note that ascii-based transliteration schemes don't bother with
    bidirectionality of number strings but have been quite useful, at least
    to the scholarly community, for a long time.) 256 characters is
    adequate to cover Arabic completely, so 8 bits is enough.

    The reason (or my reason, anyway) for experimenting with alternative
    encoding designs is not because Unicode is incapable of encoding the
    graphic forms of text, but because it rules out some kinds of
    "grammatical" semantics (for lack of a better term) that can easily be
    associated with characters, and that allow for much more powerful text
    processing. For example, traditional Arabic grammar distinguishes many
    different "kinds" of alef. They all use the alef letterform encoded by
    Unicode, but they have different functions, some graphotactic, some
    phonological, maybe some others. Obviously they could all be encoded
    with different codepoints that use the same glyph; just as obviously
    this would be outside the scope of Unicode. However there are other
    cases where the dividing line is not so clear. The fun thing about
    Arabic is that various kinds of grammatical semantics can be attached to
    single characters; you can't really do that in English.

    (By the way, there's the real contrary to plaintext: character codes
    that denote grammatical semantics rather than just graphemic semantics.)

    In any case, by piggy-backing on a widely implemented encoding like
    latin-1, you can encode text using an experimental design and use
    existing tools to work with it in various ways, make it available to
    others, etc., so you can find out what really works and is useful to
    others, rather than speculating. So unproductive polemics on the
    Unicode list can be avoided. ;)



    This archive was generated by hypermail 2.1.5 : Thu Jun 30 2005 - 10:16:41 CDT