From: John (Eljay) Love-Jensen (firstname.lastname@example.org)
Date: Fri Oct 09 2009 - 11:17:42 CDT
[Replying to the forum this time, with a detailed response.]
Chat> I understand that in UTF-8 encoding, Unicode characters can be represented in more than one way.
I believe you are conflating two different issues. But maybe I'm mistaken.
In UTF-8 encoding, a Unicode code point can only be represented in one-and-exactly-one way.
In Unicode, many Unicode characters can be represented in either their decomposed form (NFD) or their pre-composed form (NFC).
UTF-8 encoding will encode NFD, or NFC, or a hybrid of the two without problem.
UTF-8 encoding does not guarantee normalization of the Unicode.
The UTF-8 encoding and the Unicode normalization are issues at entirely different protocol layers.
Chat> Like for the US ASCII characters, it can be represented as "shortest form" and "non-shortest form".
I am confused. I'm not aware of ASCII, which is a very old 7-bit standard, having any form other than the 7-bit representation.
Chat> With these issues, java1.6.0_11 changed the UTF-8 charset implementation to disregard the "non-shortest form".
I'm out of the Java loop, having last used it with Java 1.5.
I'm not sure if "non-shortest form" in this context refers to (malformed) UTF-8 encoding, or to Unicode normalization (NFC, NFD, or denormalized*)).
* where "denormalized" means either situation that the Unicode stream could be a hybrid of decomposed Unicode sequences and pre-composed Unicode characters (and as such cannot yet-be-trusted to be NFC or NFD), -OR- has not yet been conformed to NFC or NFD (depending on the application's needs).
Someone with Java 1.6 (0_11) expertise may be able to provide further assistance.
Chat> How does UTF-8 identify that a byte sequence is illegal?
UTF-8 has four ranges of Unicode code points:
U+000000 ... U+00007F (bit pattern xxxxxxx)
UTF-8 encodes as 1 byte, from 0x00 to 0x7F, with the bit pattern 0xxxxxxx
U+000080 ... U+0007FF (bit pattern: yyyyyxxxxxx)
UTF-8 encodes as 2 bytes, the lead byte having 110yyyyy set, and the trailing byte having 10xxxxxx.
U+000800 ... U+00FFFF (bit pattern: zzzzyyyyyyxxxxxx)
UTF-8 encodes as 3 bytes, the lead byte having 1110zzzz set, and the trailing two bytes having 10yyyyyy and 10xxxxxx.
U+010000 ... U+10FFFF (bit pattern: wwwzzzzzzyyyyyyxxxxxx)
UTF-8 encodes as 4 bytes, the lead byte having 11110www set, and the trailing three bytes having 10zzzzzz, 10yyyyyy, 10xxxxxx.
Some interesting points in the UTF-8 encoding process:
You will NEVER see (C0) or (C1) in a UTF-8 data stream. (C0) and (C1) would occur as a lead byte for a value in the U+000000 to U+00007F range, but that would result in non-shortest form.
UTF-8 can handle 0x000000 through 0x1FFFFF, which is 21-bits of data. However, Unicode runs from U+000000 through U+10FFFF.
As such, these bytes will NEVER show up in a conforming UTF-8 stream:
(C0) - lead byte for U+000000 - U+00003F? Never happen. Non-shortest form.
(C1) - lead byte for U+000040 - U+00007F? Never happen. Non-shortest form.
(F5) - lead byte for U+140000 - U+17FFFF? Out of range. Max is U+10FFFF.
(F6) - lead byte for U+180000 - U+1BFFFF? Out of range. Max is U+10FFFF.
(F7) - lead byte for U+1C0000 - U+1FFFFF? Out of range. Max is U+10FFFF.
(F8) - UTF-8 does not have a "5 byte" sequence for 26-bit encoding. Never happen.
(F9) - UTF-8 does not have a "5 byte" sequence for 26-bit encoding. Never happen.
(FA) - UTF-8 does not have a "5 byte" sequence for 26-bit encoding. Never happen.
(FB) - UTF-8 does not have a "5 byte" sequence for 26-bit encoding. Never happen.
(FC) - UTF-8 does not have a "6 byte" sequence for 31-bit encoding. Never happen.
(FD) - UTF-8 does not have a "6 byte" sequence for 31-bit encoding. Never happen.
(FE) - Illegal UTF-8 character. Never happen.
(FF) - Illegal UTF-8 character. Never happen.
Note: (F8)-(FD) may show up for ISO 10646 "UTF-8-like" encoding of it's 31-bit code space.
ISO 10646 encoding is not Unicode. "UTF-8-like" is not UTF-8.
Unicode has a 21-bit code space, and so (F8)-(FD) will not appear in conforming UTF-8.
The UTF-8 parser will (or at least should) detect using non-shortest form of UTF-8 as an error.
The UTF-8 parser will detect a lead byte followed by the wrong number of trailing bytes as an error. (A trailing byte has bits 10xxxxxx, where xxxxxx is payload data bits).
The UTF-8 parser will detect a lead byte followed by an end-of-stream (instead of the required number of trailing bytes) as an error.
The UTF-8 parser will detect a surrogate U+00D800 ... U+00DFFF as an error. Surrogates should not appear in a UTF-8 stream.
NOTE: in the above, I'm using "byte" and "UTF-8 encoding unit" rather interchangeably.
Chat> That the sequence is in the non-shortest form?
In the context of UTF-8, non-shortest form of U+000020 SPACE is any of:
E0 80 A0
F0 80 80 A0
All of which a UTF-8 parser will (or should) detect as an error. And a compliant UTF-8 encoder should never generate/emit.
Chat> Who/how does "non-shortest form" be encoded?
In the context of "non-shortest form" (i.e., illegal) UTF-8 encoding, by a UTF-8 encoder with bugs.
In the context of "non-shortest form" (i.e., NFD) Unicode normalization, by the application that normalizes its Unicode strings to NFD.
Chat> For xml files for example, who transforms these characters into bytes which in turn could turn into "shortest" or "non-shortest"?
In the context of UTF-8, the UTF-8 encoder which generated the XML.
In the context of Unicode normalization, at a (much) higher level than the UTF-8 encoder.
Chat> When a program reads from an xml file with UTF-8 encoding, is it possible that the byte decoded is in "non-shortest form?"
In the context of UTF-8, only if the UTF-8 encoding stream is malformed.
In the context of Unicode normalization, that's outside the problem domain of UTF-8.
This archive was generated by hypermail 2.1.5 : Fri Oct 09 2009 - 11:20:07 CDT