Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 12:16:30 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/20 14:00, Christopher Fynn at cfynn@gmx.net wrote:

    > Something like 99% of text data uses only BMP characters for which UTF-16
    > is pretty efficient.

    One can achieve better efficiency, if needed, using data compression
    methods. So there is no reason to use UTF-16 for such reasons.

    > Didn't MS natively support Unicode (/UCS-2) with the first version of
    > Windows NT - before UTF-8 came along - and chose a 16-bit form because
    > that's was what Unicode was at the time NT was developed?

    I think that was the reason MS did it. Also, 16 bits are said to used in
    Asian languages for the same reason.

    > Doesn't MAC OSX use UTF-16 for most of it's native APIs - except for stuff
    > that calls BSD system routines?

    MacOS is built up using UNIX BSD at the bottom. According to my memory, it
    uses UTF-8 in filenames and the like. Linux also uses UTF-8. GNU GCC uses 32
    bits in wchar_t, and C is the language to build UNIX. MacOS officially uses
    a GNU GCC. So in that domain, I think there is little use of UTF-16.

    The main problem is that in some domains, UTF-16 is already at use. So
    there, one would need time to change. In the case of the C++ standard, one
    knows it takes at least a few years for a new versions to come forth. I do
    not remember the exact wording for a feature that is still in the standard,
    but to be phased out in a later version.

    In the case of Unicode, it is fairly easy to make converters from UTF-16 to
    UTF-8 or UTF-32. So there appears that no major inconveniences would be
    caused, given enough time for the transitions. My guess is that UTF-8 will
    be w widespread file and external stream format, because it is more compact,
    and (without BOM requirement) compatible with 8-bit extended ASCII. But
    internally, in programs that require speed, UTF-32 is the one to choose.
    There, UTF-16 does not offer any clear cut advantage, unless one is
    positively sure to stay within the 16bit base most of the time. But Unicode
    has some very important extension outside the 2^16 range. For example, many
    pro-math symbols. So it will probably be more important in the future than
    up till now.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:18:08 CST