Re: UTF-16 clarification needed

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 07 2008 - 15:02:04 CDT

  • Next message: J: "Re: Getting A Newb Started"

    Doug Ewell responded:

    > But they do exist as ***code points***. TUS is clear there too, in
    > definitions D9 and D10.

    Correct.

    >
    > I'd like to wait for Ken or Mark or somebody to issue a bull on this.

    *hehe*

    > I
    > think I gave the correct answer to the question Jeroen asked, and you
    > are giving the correct answer for the question you think Jeroen really
    > meant to ask.

    I think Addison answered the question well, actually. There isn't
    a whole lot to add to that, but I'll maunder on, anyway...

    Jeroen's follow-up question was:

    > OK, and when you have them together in a surrogate pair, do you call it a
    > pair of code units or can you also call them a pair of code points?

    The way to think about this clearly is to specify the *context* in
    which you "have a surrogate pair".

    If you are talking about a UTF-16 string, then what that string consists
    of (if well-formed) is a sequence of UTF-16 code *units*. In that
    context:

        <0041 D840 DC45 0041>
        
    is a sequence of 4 UTF-16 code units, two of which constitute a
    well-formed surrogate pair.

    That UTF-16 string can be *interpreted* as a sequence of Unicode
    code points, namely:

        <0041, 20045, 0041>
        
    and from the standard, we know that the code point value 0041
    (or U+0041) represents LATIN CAPITAL LETTER A and the code point
    value 20045 (or U+20045) represents CJK UNIFIED IDEOGRAPH-20045.

    In the context of your UTF-16 string, by the definition of UTF-16
    (D91), the isolated code unit value of D840 by itself can*not*
    be intrepreted as Unicode code point -- it is only part of the
    surrogate pair, i.e. part of a sequence of bits that together
    represent U+20045.

    However, outside of the context of your UTF-16 string, and
    considered in the context of the architecture of the overall
    standard, U+D840 certainly *is* a code point. It has a designated
    function, but that function requires that it never be assigned
    an abstract character.

    When you are talking about UTF-16 strings, however, you are
    best to simply ignore the status of U+D840 in the overall
    architecture. In UTF-16, D840 by itself is no more meaningful than
    would be a BF byte value by itself in UTF-8.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 15:03:36 CDT