Re: Subject: Re: 32'nd bit & UTF-8

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Jan 20 2005 - 13:09:18 CST

Next message: Addison Phillips [wM]: "RE: UTF-8 'BOM'"

Previous message: Andrew C. West: "Re: UTF-8 'BOM'"
In reply to: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg va escriure:

>> Something like 99% of text data uses only BMP characters for which
>> UTF-16 is pretty efficient.
>
> One can achieve better efficiency, if needed, using data compression
> methods. So there is no reason to use UTF-16 for such reasons.

OTOH, variable length encoding such as UTF-8 is a nightmare when it comes to
efficiency for a number of applications. So what? The answer looks like
different depending on the needs: different needs, different answers.
UTF-16 is a compromise, as is UTF-8, as is UCS in a lot of ways.

> According to my memory, [...].
> Linux also uses UTF-8.

Well, among others character sets. The main distributions head this way,
yes, probably. All Linux boxes are running with UTF-8, certainly not. Heaven
forbids.

> So in that domain, I think there is little use of UTF-16.

So what?

> The main problem is that in some domains, UTF-16 is already at use.
> So there, one would need time to change.

This assumes that UTF-16 is 'wrong', isn't it? And furthermore, that UNIX
(whatever you are hiding behind this word) is 'right'.

> In the case of the C++
> standard, one knows it takes at least a few years for a new
> versions to come forth. I do not remember the exact wording for a
> feature that is still in the standard, but to be phased out in a
> later version.

'Deprecated'. For example, ANSI C (1989) deprecated the use of KnR-style
fonctions. But it still a part of the current standard, so it will be
*required* to be supported by all conforming compilers till at least 2009.
In other words, do not hold your breath.

The life cycle of ISO standards has few in common with high-tech evolution.
UTF-16 (and UTF-8, and BOM) are part of a ISO standard. So... do not hold
your breath!

> My guess is that UTF-8 will be widespread file and
> external stream format, because it is more compact,

AH AH AH!
I happen to work with Indic contents. Taking the natural/legacy encoding
(ISCII) as 1, UTF-16 is roughly 2, and UTF-8 is more than 2.7. That is, it
become bigger than the (Unicode) Latin transcriptions using a lot of
accents!
East Asians users have similar concerns.

Granted, this is more compact in Europe (where I am perfectly happy with
Latin-9, BTW), or for application such as... GCC, a compiler.

So what?

Antoine

Next message: Addison Phillips [wM]: "RE: UTF-8 'BOM'"
Previous message: Andrew C. West: "Re: UTF-8 'BOM'"
In reply to: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 13:14:54 CST