Re: Subject: Re: 32'nd bit & UTF-8

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Jan 20 2005 - 13:09:18 CST

  • Next message: Addison Phillips [wM]: "RE: UTF-8 'BOM'"

    Hans Aberg va escriure:

    >> Something like 99% of text data uses only BMP characters for which
    >> UTF-16 is pretty efficient.
    >
    > One can achieve better efficiency, if needed, using data compression
    > methods. So there is no reason to use UTF-16 for such reasons.

    OTOH, variable length encoding such as UTF-8 is a nightmare when it comes to
    efficiency for a number of applications. So what? The answer looks like
    different depending on the needs: different needs, different answers.
    UTF-16 is a compromise, as is UTF-8, as is UCS in a lot of ways.

    > According to my memory, [...].
    > Linux also uses UTF-8.

    Well, among others character sets. The main distributions head this way,
    yes, probably. All Linux boxes are running with UTF-8, certainly not. Heaven
    forbids.

    > So in that domain, I think there is little use of UTF-16.

    So what?

    > The main problem is that in some domains, UTF-16 is already at use.
    > So there, one would need time to change.

    This assumes that UTF-16 is 'wrong', isn't it? And furthermore, that UNIX
    (whatever you are hiding behind this word) is 'right'.

    > In the case of the C++
    > standard, one knows it takes at least a few years for a new
    > versions to come forth. I do not remember the exact wording for a
    > feature that is still in the standard, but to be phased out in a
    > later version.

    'Deprecated'. For example, ANSI C (1989) deprecated the use of KnR-style
    fonctions. But it still a part of the current standard, so it will be
    *required* to be supported by all conforming compilers till at least 2009.
    In other words, do not hold your breath.

    The life cycle of ISO standards has few in common with high-tech evolution.
    UTF-16 (and UTF-8, and BOM) are part of a ISO standard. So... do not hold
    your breath!

    > My guess is that UTF-8 will be widespread file and
    > external stream format, because it is more compact,

    AH AH AH!
    I happen to work with Indic contents. Taking the natural/legacy encoding
    (ISCII) as 1, UTF-16 is roughly 2, and UTF-8 is more than 2.7. That is, it
    become bigger than the (Unicode) Latin transcriptions using a lot of
    accents!
    East Asians users have similar concerns.

    Granted, this is more compact in Europe (where I am perfectly happy with
    Latin-9, BTW), or for application such as... GCC, a compiler.

    So what?

    Antoine



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 13:14:54 CST