Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 12:16:30 CST

Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: UTF-8 'BOM'"
In reply to: Christopher Fynn: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/20 14:00, Christopher Fynn at cfynn@gmx.net wrote:

> Something like 99% of text data uses only BMP characters for which UTF-16
> is pretty efficient.

One can achieve better efficiency, if needed, using data compression
methods. So there is no reason to use UTF-16 for such reasons.

> Didn't MS natively support Unicode (/UCS-2) with the first version of
> Windows NT - before UTF-8 came along - and chose a 16-bit form because
> that's was what Unicode was at the time NT was developed?

I think that was the reason MS did it. Also, 16 bits are said to used in
Asian languages for the same reason.

> Doesn't MAC OSX use UTF-16 for most of it's native APIs - except for stuff
> that calls BSD system routines?

MacOS is built up using UNIX BSD at the bottom. According to my memory, it
uses UTF-8 in filenames and the like. Linux also uses UTF-8. GNU GCC uses 32
bits in wchar_t, and C is the language to build UNIX. MacOS officially uses
a GNU GCC. So in that domain, I think there is little use of UTF-16.

The main problem is that in some domains, UTF-16 is already at use. So
there, one would need time to change. In the case of the C++ standard, one
knows it takes at least a few years for a new versions to come forth. I do
not remember the exact wording for a feature that is still in the standard,
but to be phased out in a later version.

In the case of Unicode, it is fairly easy to make converters from UTF-16 to
UTF-8 or UTF-32. So there appears that no major inconveniences would be
caused, given enough time for the transitions. My guess is that UTF-8 will
be w widespread file and external stream format, because it is more compact,
and (without BOM requirement) compatible with 8-bit extended ASCII. But
internally, in programs that require speed, UTF-32 is the one to choose.
There, UTF-16 does not offer any clear cut advantage, unless one is
positively sure to stay within the 16bit base most of the time. But Unicode
has some very important extension outside the 2^16 range. For example, many
pro-math symbols. So it will probably be more important in the future than
up till now.

Hans Aberg

Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: UTF-8 'BOM'"
In reply to: Christopher Fynn: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:18:08 CST