Re: UTF-16 inside UTF-8

From: YTang0648@aol.com
Date: Wed Nov 05 2003 - 19:58:34 EST

Next message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"

Previous message: Peter Kirk: "Re: Merging combining classes, was: New contribution N2676"
Maybe in reply to: Jill Ramonsky: "UTF-16 inside UTF-8"
Next in thread: Doug Ewell: "Re: UTF-16 inside UTF-8"
Reply: Doug Ewell: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In a message dated 11/5/2003 3:42:42 PM Pacific Standard Time,
dewell@adelphia.net writes:
Topic-change alert! I'm not talking about glyph support in fonts, or
bidi support, or collation, or contextual shaping, or any other aspect
of Unicode support. I'm talking about completely denying the existence
of non-BMP characters.

There are tons of applications -- Notepad is a basic example -- that
allow the entry of any arbitrary BMP character. They don't allow some
BMP characters and disallow others. That's all I'm talking about. Now,
if such an application allows BMP characters but disallows supplementary
characters, as MySQL (e.g.) does, I think that is an unnecessary
restriction.
Surrogate is defined in Unicode 2.0, which is published in 1996. Does NotePad
in Windows 98 support it two years after Unicode 2.0 published? No, MS not
even support Surrogaet in NotePad which came with WinME. In fact, you need to
install special package into Win2K to enable Surrogate support. Why it take that
long? Very simple. Because it is not as simple as you thought. If you
caculate how long it take for MS to add surrogate support to the window support from
the time surrogate defined in Unicode 2.0, you probably can find out how long
it will take for a software to add surrogate support if they just start to
add Unicode support.
One of these days I'm going to implement a "Unicode" front end that
supports Basic Latin and U+A068 YI SYLLABLE BBOP, but *no other
characters*, just to show how silly such a restriction would be.
(Remember, it's conformant as long as I don't lie about it. That
doesn't mean it's not silly.)
There are huge gap between "not silly" and "make it work". It is not that
simple to make the whole software support surrogate correctly in every aspect.
> For back end software which do pure data process without keyboard
> input or text rendering, it is eaiser to implement the whole Unicode
> BMP range or even with the surrogate.

(1) "Surrogates" are only about UTF-16, not any other aspect of
Unicode.
(2) Supporting surrogates in UTF-16 is not tremendously difficult.
ok.
example to show you how difficult to support surrogate:

Example 1: I have this api
UniChar is defined to be two byte holding 16 bits.
UniChar ToLower(UniChar aChar)
Tell me how to support Surrogate?

Example 2:
I have api

int FindCharInString( String, UniChar)

Tell me what the return value should mean ? Should it mean the count of
UniChar from the beginning of String or should it mean the coutn of the CHARACTER
from the beginning of the String. What should I do when I start to add
surrogate support?

Example 3:

I have api
int LengthOfString(String)
Should this api return the number of UniChar or the number of CHARACTER?

Example 4:
I have api

String Left(String, int a)
What should a mean, the index of the UniChar or the index of CHARACTER?

>> and implementing UTF-8 support for the entire Unicode code space is
>> about 0.1% harder than artificially crippling it by restricting it to
>> the BMP.
>
> Disagree about what you said "about 0.1 % harder".
>
> For many developers, adding 4 bytes UTF-8 to surrogate support simply
> mean open a can of worm.

See point (1) above.

> After that, they need to worry about how to
> support surrogate, which is quite complex in the api design/change.

See points (1) and (2) above.

> The work to make the converter convert UTF-8 to a surrogate pair and
> back is probably as you said "0.1 harder". But work AFTER they open
> such door is much harder to manage. As the famouse saying "Unicode is
> not the answer for Internationalization, Unicode is the question for
> the Internationalization". Thanks for all the job opportunity Unicode
> standard created (and keep creating) of us :)

See point (1) above. Other than UTF-16 surrogates -- and remember, this
is not 1993; the world of Unicode no longer revolves around the 16-bit
encoding form -- what aspect of supplementary character support is so
much more complicated than BMP support?
1. Depending technology- for example, your software depend on Tcl but
Tcl8.4.4 does not support surrogate.
2. Dependnig protocols- for example GSM 03.38 only define default alphabet,
UCS-2 but not UTF-8. What is the piont for a GSM gateway to take the surrogate
or not. Why bother, it will not be shown on people's cell phone because of the
GSM protocol anyway.
3. The definitation of API- for example-
you have String int indexOf(int ch)
Returns the index within this string of the first occurrence of the
specified character.

if the string a is "b" + a surrogaet pari + "c" and I call a.indexOf("c").
What should it return 1 or 2? if then the caller than call a.charAt(2) what
should I return? the low surrogate? or the "c"?
char charAt(int index)
Returns the character at the specified index.
How can I return the whole surrogate pair if someone call a.charAt(1) ? or I
should just return the high surrogate?

String substring(int beginIndex)
Returns a new string that is a substring of this string.
what should we return if someoen call a.substring(2) ? the low surrogate and
the "c"? the high surrogate + the low surrogate plus the "c" ? error? What
will happen if origionally the software do not return error code for substring
and there are no excepting model to be involked?

4. Memory and Performance trade off.

You prbably can get a sense of difficulty if you look at how many
specification change MS need to make to add surrogate support to the OpenType font. That
is just specification change not include code changes or API changes.

'cmap' http://www.microsoft.com/typography/otspec/cmap.htm

It is easy to add surrogate support to your application if your application
do nothing. It is difficult to add surrogate support (not impossible) if your
application do some data processing. It is hard to add surrogate support if
your software is a library which have previous defined API.

Look at
Format 4: Segment mapping to delta values
Supporting 4-byte character codes

I am not saying software should not support surrogate. I am saying don't
under estimate the efforts. And while a software does upport surrogate correctly.
Give them a praise instead of take it for granted. It is hard work.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

==================================
Frank Yung-Fong Tang
System Architect, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta mailto:ytang0648@aol.com Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 "For God so loved the world that he gave his one and only Son, that
whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's
Iñtërnâtiônàlizætiøn Secrets
Want to translate your English text to something Thailand users can
understand ?
-> Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/

Next message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"
Previous message: Peter Kirk: "Re: Merging combining classes, was: New contribution N2676"
Maybe in reply to: Jill Ramonsky: "UTF-16 inside UTF-8"
Next in thread: Doug Ewell: "Re: UTF-16 inside UTF-8"
Reply: Doug Ewell: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 20:40:36 EST