Re: UTF-16 inside UTF-8

From: Doug Ewell (
Date: Wed Nov 05 2003 - 18:38:49 EST

  • Next message: Doug Ewell: "Re: [hebrew] Re: Hebrew composition model, with cantillation marks"

    Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

    >> I don't know about the relative market needs. I think supplementary
    >> character support is important because these characters are part of
    >> Unicode just as much as BMP characters are,
    > And don't forget, many people think some of the BMP characters are not
    > important for their software. And that is probably the exact reason
    > why MES-1, MES-2 and MES-3 got created. For Front-End software, it is
    > quite difficult, or I should say impossible to support even the whole
    > BMP. I crrently see NO front-end software 100% support the whole
    > Unicode BMP correctly from input to rendering. Name me one and I can
    > tell you what they didn't do.

    Topic-change alert! I'm not talking about glyph support in fonts, or
    bidi support, or collation, or contextual shaping, or any other aspect
    of Unicode support. I'm talking about completely denying the existence
    of non-BMP characters.

    There are tons of applications -- Notepad is a basic example -- that
    allow the entry of any arbitrary BMP character. They don't allow some
    BMP characters and disallow others. That's all I'm talking about. Now,
    if such an application allows BMP characters but disallows supplementary
    characters, as MySQL (e.g.) does, I think that is an unnecessary

    One of these days I'm going to implement a "Unicode" front end that
    supports Basic Latin and U+A068 YI SYLLABLE BBOP, but *no other
    characters*, just to show how silly such a restriction would be.
    (Remember, it's conformant as long as I don't lie about it. That
    doesn't mean it's not silly.)

    > For back end software which do pure data process without keyboard
    > input or text rendering, it is eaiser to implement the whole Unicode
    > BMP range or even with the surrogate.

    (1) "Surrogates" are only about UTF-16, not any other aspect of
    (2) Supporting surrogates in UTF-16 is not tremendously difficult.

    >> and implementing UTF-8 support for the entire Unicode code space is
    >> about 0.1% harder than artificially crippling it by restricting it to
    >> the BMP.
    > Disagree about what you said "about 0.1 % harder".
    > For many developers, adding 4 bytes UTF-8 to surrogate support simply
    > mean open a can of worm.

    See point (1) above.

    > After that, they need to worry about how to
    > support surrogate, which is quite complex in the api design/change.

    See points (1) and (2) above.

    > The work to make the converter convert UTF-8 to a surrogate pair and
    > back is probably as you said "0.1 harder". But work AFTER they open
    > such door is much harder to manage. As the famouse saying "Unicode is
    > not the answer for Internationalization, Unicode is the question for
    > the Internationalization". Thanks for all the job opportunity Unicode
    > standard created (and keep creating) of us :)

    See point (1) above. Other than UTF-16 surrogates -- and remember, this
    is not 1993; the world of Unicode no longer revolves around the 16-bit
    encoding form -- what aspect of supplementary character support is so
    much more complicated than BMP support?

    -Doug Ewell
    Fullerton, California

    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 19:28:28 EST