Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Thu Jan 20 2005 - 18:52:37 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/20 20:09, Antoine Leca at wrote:

    >>> Something like 99% of text data uses only BMP characters for which
    >>> UTF-16 is pretty efficient.
    >> One can achieve better efficiency, if needed, using data compression
    >> methods. So there is no reason to use UTF-16 for such reasons.
    > OTOH, variable length encoding such as UTF-8 is a nightmare when it comes to
    > efficiency for a number of applications. So what? The answer looks like
    > different depending on the needs: different needs, different answers.

    This is something I pointed out. Internally in programs, fixed width UTF-32
    may work best.

    > UTF-16 is a compromise, as is UTF-8, as is UCS in a lot of ways.

    UTF-16 is quick fix, needed when Unicode broke the 2^16 limit.

    >> The main problem is that in some domains, UTF-16 is already at use.
    >> So there, one would need time to change.
    > This assumes that UTF-16 is 'wrong', isn't it? And furthermore, that UNIX
    > (whatever you are hiding behind this word) is 'right'.

    That seems to be the case, as a tendency of what one will actually use.
    Also, UTF-16 has the cannot be extended beyond the current Unicode limit.
    The other encodings, UTF-8/32, can.

    >> In the case of the C++
    >> standard, one knows it takes at least a few years for a new
    >> versions to come forth. I do not remember the exact wording for a
    >> feature that is still in the standard, but to be phased out in a
    >> later version.
    > 'Deprecated'. For example, ANSI C (1989) deprecated the use of KnR-style
    > fonctions. But it still a part of the current standard, so it will be
    > *required* to be supported by all conforming compilers till at least 2009.
    > In other words, do not hold your breath.

    OK. Thank you. So one might decide UTF-16 deprecated, guaranteeing it for a
    certain number of years.

    > The life cycle of ISO standards has few in common with high-tech evolution.
    > UTF-16 (and UTF-8, and BOM) are part of a ISO standard. So... do not hold
    > your breath!

    This is true. But one attempts now to speed up the updating of the ISO
    standards. But a deprecated UTF-16 would need time to phase out.

    >> My guess is that UTF-8 will be widespread file and
    >> external stream format, because it is more compact,
    > AH AH AH!
    > I happen to work with Indic contents. Taking the natural/legacy encoding
    > (ISCII) as 1, UTF-16 is roughly 2, and UTF-8 is more than 2.7.

    Relative UTF-32.

    >That is, it
    > become bigger than the (Unicode) Latin transcriptions using a lot of
    > accents!
    > East Asians users have similar concerns.

    Yes. But I since memory in computers double every 18 months or faster, this
    should not be of much problem. And if space is tight, one can combine with
    compressions algorithms, which will probably provide much better results.

    > Granted, this is more compact in Europe (where I am perfectly happy with
    > Latin-9, BTW), or for application such as... GCC, a compiler.
    > So what?

    So deprecate UTF-16!? :-)

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 18:55:12 CST