Re: UTF-16 inside UTF-8

Date: Wed Nov 05 2003 - 19:58:34 EST

  • Next message: "Re: UTF-16 inside UTF-8"

    In a message dated 11/5/2003 3:42:42 PM Pacific Standard Time, writes:
    Topic-change alert! I'm not talking about glyph support in fonts, or
    bidi support, or collation, or contextual shaping, or any other aspect
    of Unicode support. I'm talking about completely denying the existence
    of non-BMP characters.

    There are tons of applications -- Notepad is a basic example -- that
    allow the entry of any arbitrary BMP character. They don't allow some
    BMP characters and disallow others. That's all I'm talking about. Now,
    if such an application allows BMP characters but disallows supplementary
    characters, as MySQL (e.g.) does, I think that is an unnecessary
    Surrogate is defined in Unicode 2.0, which is published in 1996. Does NotePad
    in Windows 98 support it two years after Unicode 2.0 published? No, MS not
    even support Surrogaet in NotePad which came with WinME. In fact, you need to
    install special package into Win2K to enable Surrogate support. Why it take that
    long? Very simple. Because it is not as simple as you thought. If you
    caculate how long it take for MS to add surrogate support to the window support from
    the time surrogate defined in Unicode 2.0, you probably can find out how long
    it will take for a software to add surrogate support if they just start to
    add Unicode support.
    One of these days I'm going to implement a "Unicode" front end that
    supports Basic Latin and U+A068 YI SYLLABLE BBOP, but *no other
    characters*, just to show how silly such a restriction would be.
    (Remember, it's conformant as long as I don't lie about it. That
    doesn't mean it's not silly.)
    There are huge gap between "not silly" and "make it work". It is not that
    simple to make the whole software support surrogate correctly in every aspect.
    > For back end software which do pure data process without keyboard
    > input or text rendering, it is eaiser to implement the whole Unicode
    > BMP range or even with the surrogate.

    (1) "Surrogates" are only about UTF-16, not any other aspect of
    (2) Supporting surrogates in UTF-16 is not tremendously difficult.
    example to show you how difficult to support surrogate:

    Example 1: I have this api
    UniChar is defined to be two byte holding 16 bits.
    UniChar ToLower(UniChar aChar)
    Tell me how to support Surrogate?

    Example 2:
    I have api

    int FindCharInString( String, UniChar)

    Tell me what the return value should mean ? Should it mean the count of
    UniChar from the beginning of String or should it mean the coutn of the CHARACTER
    from the beginning of the String. What should I do when I start to add
    surrogate support?

    Example 3:

    I have api
    int LengthOfString(String)
    Should this api return the number of UniChar or the number of CHARACTER?

    Example 4:
    I have api

    String Left(String, int a)
    What should a mean, the index of the UniChar or the index of CHARACTER?

    >> and implementing UTF-8 support for the entire Unicode code space is
    >> about 0.1% harder than artificially crippling it by restricting it to
    >> the BMP.
    > Disagree about what you said "about 0.1 % harder".
    > For many developers, adding 4 bytes UTF-8 to surrogate support simply
    > mean open a can of worm.

    See point (1) above.

    > After that, they need to worry about how to
    > support surrogate, which is quite complex in the api design/change.

    See points (1) and (2) above.

    > The work to make the converter convert UTF-8 to a surrogate pair and
    > back is probably as you said "0.1 harder". But work AFTER they open
    > such door is much harder to manage. As the famouse saying "Unicode is
    > not the answer for Internationalization, Unicode is the question for
    > the Internationalization". Thanks for all the job opportunity Unicode
    > standard created (and keep creating) of us :)

    See point (1) above. Other than UTF-16 surrogates -- and remember, this
    is not 1993; the world of Unicode no longer revolves around the 16-bit
    encoding form -- what aspect of supplementary character support is so
    much more complicated than BMP support?
    1. Depending technology- for example, your software depend on Tcl but
    Tcl8.4.4 does not support surrogate.
    2. Dependnig protocols- for example GSM 03.38 only define default alphabet,
    UCS-2 but not UTF-8. What is the piont for a GSM gateway to take the surrogate
    or not. Why bother, it will not be shown on people's cell phone because of the
    GSM protocol anyway.
    3. The definitation of API- for example-
    you have String int indexOf(int ch)
              Returns the index within this string of the first occurrence of the
    specified character.

    if the string a is "b" + a surrogaet pari + "c" and I call a.indexOf("c").
    What should it return 1 or 2? if then the caller than call a.charAt(2) what
    should I return? the low surrogate? or the "c"?
      char charAt(int index)
              Returns the character at the specified index.
    How can I return the whole surrogate pair if someone call a.charAt(1) ? or I
    should just return the high surrogate?

      String substring(int beginIndex)
              Returns a new string that is a substring of this string.
    what should we return if someoen call a.substring(2) ? the low surrogate and
    the "c"? the high surrogate + the low surrogate plus the "c" ? error? What
    will happen if origionally the software do not return error code for substring
    and there are no excepting model to be involked?

    4. Memory and Performance trade off.

    You prbably can get a sense of difficulty if you look at how many
    specification change MS need to make to add surrogate support to the OpenType font. That
    is just specification change not include code changes or API changes.


    It is easy to add surrogate support to your application if your application
    do nothing. It is difficult to add surrogate support (not impossible) if your
    application do some data processing. It is hard to add surrogate support if
    your software is a library which have previous defined API.

    Look at
     Format 4: Segment mapping to delta values
     Supporting 4-byte character codes

    I am not saying software should not support surrogate. I am saying don't
    under estimate the efforts. And while a software does upport surrogate correctly.
    Give them a praise instead of take it for granted. It is hard work.

    -Doug Ewell
    Fullerton, California

    Frank Yung-Fong Tang
    System Architect, Itrntinl Dvlpmet, AOL Intrtv Srvies
    AIM:yungfongta Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan

    John 3:16 "For God so loved the world that he gave his one and only Son, that
    whoever believes in him shall not perish but have eternal life.

    Does your software display Thai language text correctly for Thailand users?
    -> Basic Conceptof Thai Language linked from Frank Tang's
    Itrntinliztin Secrets
    Want to translate your English text to something Thailand users can
    understand ?
    -> Try English-to-Thai machine translation at

    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 20:40:36 EST