From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 03 2003 - 01:39:03 EST
Frank Yung-Fong Tang <ytang0648 at aol dot com> wrote:
> Agree with you. Just want to make a point that the implementation is
> not "< 1%" of the work.
Oh, for heaven's sake:
If you are starting with a NON-UNICODE application -- one that has NO
prior knowledge of UTF-anything or UCS-anything -- and you are adding
"Unicode support" to it, the amount of work to support the entire
17-plane Unicode range compared to just the BMP is relatively small. If
I ever said "less than one percent," I apologize. Such a figure can
only be determined on a case-by-case basis.
>> I'll be happy to supply UTF-8 code that handles 4-byte sequences.
>> That is not the same thing as converting an entire system from
>> 16-bit to 32-bit integers, or adding proper UTF-16 surrogate support
>> to a UCS-2-only system. Of course that is more work.
>
> You view is based on the assumption the internal code is UCS4 instead
> of UTF-16.
Didn't you read what I wrote?
> Nothing wrong if people choose to use UTF-16 instead of UCS4 in the
> API, even as 2003. Do you agree?
Sure, no problem. Both UTF-16 and UCS-4 (= UTF-32) support the full
Unicode range. Only UCS-2 does not.
> If people do use UTF-16 in the API, it is nature for people who do
> care about BMP but not care about Plan 1-16 to only work on BMP,
> right? I am not saying they do the right thing. I am saying they do
> the "nature" thing. Remember, the text describe about 'Surrogate' in
> the Unocde 4.0 standard is probably only 5-10 pages total in that 1462
> pages standard. For developer who won't going to implement the rest
> 1000 pages right, it is nature for them to think "why do I need to
> make this 10 pages right?"
I don't care if they choose not to provide fonts or rendering support
for the supplementary planes. But it seems silly to deliberately
exclude them from the underlying architecture.
"Using UTF-16" implies that one supports the surrogate mechanism.
UTF-16 without surrogate support is UCS-2.
Of course the Unicode Standard doesn't spend a lot of time describing
the surrogate mechanism. It only applies to the UTF-16 character
encoding form. The description of characters encoded in the
supplementary planes, however, is much more extensive.
>> I can't fight this battle with people who would rather stay with
>> ASCII and 7/8 bits per character. They are not living in a Unicode
>> world.
>
> But how about the UTF-16 vs UCS4 battle?
Well, UTF-16 certainly does occupy less space than UTF-32 (henceforth I
will use this term instead of "UCS-4") in memory, on disk, wherever. By
all means, when *storing* large amounts of data, use an appropriately
compact form. That might mean UTF-16, UTF-8, or a compression format
such as SCSU or BOCU-1, or it might mean compressing the data using gzip
or bzip2.
When *processing* character data in memory, I would assume a fixed-width
encoding like UTF-32 would be more convenient than a variable-width
encoding like UTF-16. But if the extra complexity (such as it is) of
UTF-16 is not a problem, by all means go ahead and use it.
>> I would truly be surprised if full 17-plane Unicode support in a
>> single app could be demonstrated to be a matter of "multiple millions
>> of dollars."
>
> It is not the full 17-plane Unicode support which will contribut to
> it. It is the
> (Number of ASCII only records X sizeof (records in UCS4)) - ( Number
> of ASCII only records X sizeof(record in ASCII))
>
> contribute to that.
>
> compare to
>
> (Number of ASCII only records X sizeof (records in UTF-8)) - ( Number
> of ASCII only records X sizeof(record in ASCII))
> or
>
> (Number of ASCII only records X sizeof (records in UTF-16)) - ( Number
> of ASCII only records X sizeof(record in ASCII))
>
> The other comparision is
> (Number of BMP only records X sizeof (records in UCS4)) - ( Number of
> BMP only records X sizeof(record in UTF-8))
>
> (Number of BMP only records X sizeof (records in UCS4)) - ( Number of
> BMP only records X sizeof(record in UTF-16))
>
> of course, the sizeof() is really the "average size of record with
> those data"
I have never suggested that people with ASCII-only data should suddenly
quadruple their storage needs by storing it all in UTF-32. That's what
UTF-8 and SCSU are for. In fact, their data is already in UTF-8, isn't
it?
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 02:13:16 EST