Re: UTF-16 inside UTF-8

From: YTang0648@aol.com
Date: Wed Nov 05 2003 - 14:48:44 EST

  • Next message: John Hudson: "RE: ZWJ/ZWNJ in combining mark sequences"

    In a message dated 11/5/2003 11:15:44 AM Pacific Standard Time,
    dewell@adelphia.net writes:
    Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote:

    >> At the risk of upsetting the open-source faithful, that is just plain
    >> lazy.
    >
    > I don't think you shoudl call it "lazy". It is just "under
    > construction" if such software is still in "alpha". How many software
    > have such support in their "Alpha" stage in your company ?

    My company is not the best example here; we're well behind the curve
    when it comes to Unicode, and i18n/L10n generally.

    That said, I think it would be much faster and less error-prone, for a
    company adding Unicode support to a product for the first time, to
    support the entire Unicode range from the outset, rather than supporting
    just the BMP in the alpha stage and then "adding" support for
    supplementary characters. For UTF-8 in particular, I can't imagine why
    one would choose to implement the 1-, 2-, and 3-byte forms in one stage
    and add the 4-byte forms in a later stage.

    If you ever move a software implementation from support only single byte
    charset to support full unicode 4.0 , then you will be able to image it.
    Especially if the project also have 20-100 people working that who don't care too much
    about unicode or international support. I have working on such projects for
    more than 10 years. And for me, it is very reasonable to have such staging
    approach.

    for a very simple reason. Usually what happen is the software need to use
    something other than UTF-8 for internal process. For example, mozilla take UTF-8
    as input and it convert to UTF-16 for internal storage. The reason the UTF-8
    is not ideal for some internal process is for example "ToUpper" ot 'ToLower"
    operation (or collation, etc) it is much easier to build a UCS2 base toupper to
    lower table than a UTF-8 based one.

    Because of this, software process probably don't want to use UTF-8 as
    internal. It is ok for those software which just store the data or pass the data by
    to use UTF-8 as internal, but UTF-8 is not ideal as internal format for those
    software process data.

    Then the next reason is the software may have some api which take or return
    character index of a string. For example if your software have api like the
    following:

    int TheFirstCharacterInTheString( String, Character) return the first
    character index of the character in the String
    or
    string TheLeftSubString( String, Length) return the left "length' characters.

    then UCS2 or UCS4 is eaiser to deal with, and UTF-8 or UTF-16 is much harder
    to deal with. Because in UCS2 or UCS4, you can find out the memory requirments
    / offset from the character index, and vise versa. But in the UTF-8 or
    UTF-16, you cannot. For return the index or lenght, you basically need to have two
    set of api, one to return the number of "characters" and one to return the
    number "memory requirment" if the caller may need to prepare the memory. )

    Because of this, it is much easier to use UCS2 or UCS4 in the API or probably
    I should say private interface inside the software. However, using UCS4 will
    doble the memory requirment compare to UCS2, which already double the memory
    requirment from the single byte only support (for some software, that mean the
    last version). Therefore, it is eaiser to move from only support single bytes
    encoding to move to a UTF-8 support which only up to 3 bytes in the first
    version they move to Unicode.

    I am not saying this is the ideal case and they should do that. I am just
    telling you what will people face and think when they move from a ISO-8859-1 only
    implementation to a pure Unicode implementation. A lot of time, they need to
    deal with one thing per step.

    Usually the staging approach is
    1. add the internal data type from char to some other data type, probably a
    typedef uniChar
    if you ask the uniChar to be 4 bytes, you will hit a hard wall, die and stop
    there. If you ask the uniChar to be 2 bytes, you will hit a wall, break both
    your head and the wall and continue.
    2. add converter to convert ISO-8859-1 and UTF-8 from/to that uniChar
    3. Migrate all the code
    4. Talk to people about support UTF-16 or change uniChar to 4 bytes after you
    proof changing 1-3 bring in a lot of value and does not cause too much
    performance/footprint issue.

    -Doug Ewell
    Fullerton, California
    http://users.adelphia.net/~dewell/

    ==================================
    Frank Yung-Fong Tang
    System Architect, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
    AIM:yungfongta mailto:ytang0648@aol.com Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan

    John 3:16 "For God so loved the world that he gave his one and only Son, that
    whoever believes in him shall not perish but have eternal life.

    Does your software display Thai language text correctly for Thailand users?
    -> Basic Conceptof Thai Language linked from Frank Tang's
    Iñtërnâtiônàlizætiøn Secrets
    Want to translate your English text to something Thailand users can
    understand ?
    -> Try English-to-Thai machine translation at
    http://c3po.links.nectec.or.th/parsit/



    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 15:37:31 EST