From: Chat S. Depasucat (firstname.lastname@example.org)
Date: Thu Oct 08 2009 - 05:03:14 CDT
im really thankful that i get to find this mailing list.
i have few UTF-8 issues that I wish somebody could give light on:
I understand that in UTF-8 encoding, Unicode characters can be
represented in more than one way.
Like for the US ASCII characters, it can be represented as "shortest
form" and "non-shortest form".
With these issues, java1.6.0_11 changed the UTF-8 charset implementation
to disregard the "non-shortest form".
Here are my questions:
1. How does UTF-8 identify that a byte sequence is illegal? That the
sequence is in the non-shortest form?
2. Who/how does "non-shortest form" be encoded?
For xml files for example, who transforms these characters into
bytes which in turn could turn into "shortest" or "non-shortest"?
When a program reads from an xml file with UTF-8 encoding, is it
possible that the byte decoded is in "non-shortest form?"
hope somebody could help me understand this.
thanks so much
This archive was generated by hypermail 2.1.5 : Thu Oct 08 2009 - 10:49:54 CDT