Re: Getting A Newb Started

From: Doug Ewell (dewell@roadrunner.com)
Date: Mon Jul 07 2008 - 20:57:36 CDT

  • Next message: Doug Ewell: "Re: how to add all latin (and greek) subscripts"

    John H. Jenkins <jenkins at apple dot com> wrote:

    >> ... If you use UTF-32, every char is four bytes. If you use UTF-8,
    >> characters take from one to four bytes depending on where the
    >> corresponding codepoint is. If you use UTF-16, every character in the
    >> BMP is two bytes, any character outside of the BMP takes four bytes.
    >
    > This isn't as much of an advantage as it sounds, since in most Unicode
    > processes you need to be prepared to deal with multiple characters at
    > once anyway.

    I hear this argument every so often, from different people, and it just
    doesn't ever carry any weight for me. Sure, there are lots of
    situations when processing text (Unicode or otherwise) that you need to
    deal with more than one character at a time -- especially so with
    Unicode, with its combining marks and such. But there are still many
    other string-processing situations that require functions like Length
    and IndexOf and Remove. The need to do that kind of
    character-by-character processing hasn't vanished. Just last week I
    wrote a program that operated on a fixed-column-width UTF-8 file, and to
    do that, you have to deal with characters by position.

    >> The downside of UTF-16 and UTF-8 is that characters are not the same
    >> length, which makes processing more complicated. With UTF-16,
    >> however, if you know that there are no characters outside the BMP,
    >> every character is a constant two bytes wide.
    >
    > That's the problem. You really can't make the assumption that you're
    > dealing with BMP-only text.

    Agreed. UTN #12 notwithstanding, I'm with Bill Poser in preferring to
    store Unicode text in memory as UTF-32 -- when I have to do it manually
    at all, which is less and less often, as I complete the transition from
    C++ and MFC to C# and .NET.

    --
    Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 20:59:47 CDT