RE: [question] UTF-8 issue

From: John (Eljay) Love-Jensen (eljay@adobe.com)
Date: Fri Oct 09 2009 - 11:17:42 CDT

  • Next message: Doug Ewell: "Re: [question] UTF-8 issue"

    Hi Chat,

    [Replying to the forum this time, with a detailed response.]

    Chat> I understand that in UTF-8 encoding, Unicode characters can be represented in more than one way.

    I believe you are conflating two different issues. But maybe I'm mistaken.

    In UTF-8 encoding, a Unicode code point can only be represented in one-and-exactly-one way.

    In Unicode, many Unicode characters can be represented in either their decomposed form (NFD) or their pre-composed form (NFC).

    UTF-8 encoding will encode NFD, or NFC, or a hybrid of the two without problem.

    UTF-8 encoding does not guarantee normalization of the Unicode.

    The UTF-8 encoding and the Unicode normalization are issues at entirely different protocol layers.

    Chat> Like for the US ASCII characters, it can be represented as "shortest form" and "non-shortest form".

    I am confused. I'm not aware of ASCII, which is a very old 7-bit standard, having any form other than the 7-bit representation.

    Chat> With these issues, java1.6.0_11 changed the UTF-8 charset implementation to disregard the "non-shortest form".

    I'm out of the Java loop, having last used it with Java 1.5.

    I'm not sure if "non-shortest form" in this context refers to (malformed) UTF-8 encoding, or to Unicode normalization (NFC, NFD, or denormalized*)).

    * where "denormalized" means either situation that the Unicode stream could be a hybrid of decomposed Unicode sequences and pre-composed Unicode characters (and as such cannot yet-be-trusted to be NFC or NFD), -OR- has not yet been conformed to NFC or NFD (depending on the application's needs).

    Someone with Java 1.6 (0_11) expertise may be able to provide further assistance.

    Chat> How does UTF-8 identify that a byte sequence is illegal?

    UTF-8 has four ranges of Unicode code points:

    U+000000 ... U+00007F (bit pattern xxxxxxx)
    UTF-8 encodes as 1 byte, from 0x00 to 0x7F, with the bit pattern 0xxxxxxx

    U+000080 ... U+0007FF (bit pattern: yyyyyxxxxxx)
    UTF-8 encodes as 2 bytes, the lead byte having 110yyyyy set, and the trailing byte having 10xxxxxx.

    U+000800 ... U+00FFFF (bit pattern: zzzzyyyyyyxxxxxx)
    UTF-8 encodes as 3 bytes, the lead byte having 1110zzzz set, and the trailing two bytes having 10yyyyyy and 10xxxxxx.

    U+010000 ... U+10FFFF (bit pattern: wwwzzzzzzyyyyyyxxxxxx)
    UTF-8 encodes as 4 bytes, the lead byte having 11110www set, and the trailing three bytes having 10zzzzzz, 10yyyyyy, 10xxxxxx.

    Some interesting points in the UTF-8 encoding process:

    You will NEVER see (C0) or (C1) in a UTF-8 data stream. (C0) and (C1) would occur as a lead byte for a value in the U+000000 to U+00007F range, but that would result in non-shortest form.

    UTF-8 can handle 0x000000 through 0x1FFFFF, which is 21-bits of data. However, Unicode runs from U+000000 through U+10FFFF.

    As such, these bytes will NEVER show up in a conforming UTF-8 stream:
    (C0) - lead byte for U+000000 - U+00003F? Never happen. Non-shortest form.
    (C1) - lead byte for U+000040 - U+00007F? Never happen. Non-shortest form.
    (F5) - lead byte for U+140000 - U+17FFFF? Out of range. Max is U+10FFFF.
    (F6) - lead byte for U+180000 - U+1BFFFF? Out of range. Max is U+10FFFF.
    (F7) - lead byte for U+1C0000 - U+1FFFFF? Out of range. Max is U+10FFFF.
    (F8) - UTF-8 does not have a "5 byte" sequence for 26-bit encoding. Never happen.
    (F9) - UTF-8 does not have a "5 byte" sequence for 26-bit encoding. Never happen.
    (FA) - UTF-8 does not have a "5 byte" sequence for 26-bit encoding. Never happen.
    (FB) - UTF-8 does not have a "5 byte" sequence for 26-bit encoding. Never happen.
    (FC) - UTF-8 does not have a "6 byte" sequence for 31-bit encoding. Never happen.
    (FD) - UTF-8 does not have a "6 byte" sequence for 31-bit encoding. Never happen.
    (FE) - Illegal UTF-8 character. Never happen.
    (FF) - Illegal UTF-8 character. Never happen.

    Note: (F8)-(FD) may show up for ISO 10646 "UTF-8-like" encoding of it's 31-bit code space.
    ISO 10646 encoding is not Unicode. "UTF-8-like" is not UTF-8.
    Unicode has a 21-bit code space, and so (F8)-(FD) will not appear in conforming UTF-8.

    The UTF-8 parser will (or at least should) detect using non-shortest form of UTF-8 as an error.

    The UTF-8 parser will detect a lead byte followed by the wrong number of trailing bytes as an error. (A trailing byte has bits 10xxxxxx, where xxxxxx is payload data bits).

    The UTF-8 parser will detect a lead byte followed by an end-of-stream (instead of the required number of trailing bytes) as an error.

    The UTF-8 parser will detect a surrogate U+00D800 ... U+00DFFF as an error. Surrogates should not appear in a UTF-8 stream.

    NOTE: in the above, I'm using "byte" and "UTF-8 encoding unit" rather interchangeably.

    Chat> That the sequence is in the non-shortest form?

    In the context of UTF-8, non-shortest form of U+000020 SPACE is any of:
    C0 A0
    E0 80 A0
    F0 80 80 A0

    All of which a UTF-8 parser will (or should) detect as an error. And a compliant UTF-8 encoder should never generate/emit.

    Chat> Who/how does "non-shortest form" be encoded?

    In the context of "non-shortest form" (i.e., illegal) UTF-8 encoding, by a UTF-8 encoder with bugs.

    In the context of "non-shortest form" (i.e., NFD) Unicode normalization, by the application that normalizes its Unicode strings to NFD.

    Chat> For xml files for example, who transforms these characters into bytes which in turn could turn into "shortest" or "non-shortest"?

    In the context of UTF-8, the UTF-8 encoder which generated the XML.

    In the context of Unicode normalization, at a (much) higher level than the UTF-8 encoder.

    Chat> When a program reads from an xml file with UTF-8 encoding, is it possible that the byte decoded is in "non-shortest form?"

    In the context of UTF-8, only if the UTF-8 encoding stream is malformed.

    In the context of Unicode normalization, that's outside the problem domain of UTF-8.

    Sincerely,
    --Eljay



    This archive was generated by hypermail 2.1.5 : Fri Oct 09 2009 - 11:20:07 CDT