[question] UTF-8 issue

From: Chat S. Depasucat (cdepasucat@ntsp.nec.co.jp)
Date: Thu Oct 08 2009 - 05:03:14 CDT

Next message: Rick McGowan: "Re: Unicode Haiku Contest"

Previous message: Marion Gunn: "Re: [Unicode Announcement] Unicode Haiku Contest"
Next in thread: Michael D'Errico: "Re: [question] UTF-8 issue"
Reply: Michael D'Errico: "Re: [question] UTF-8 issue"
Reply: John (Eljay) Love-Jensen: "RE: [question] UTF-8 issue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

im really thankful that i get to find this mailing list.

i have few UTF-8 issues that I wish somebody could give light on:

I understand that in UTF-8 encoding, Unicode characters can be
represented in more than one way.
Like for the US ASCII characters, it can be represented as "shortest
form" and "non-shortest form".
With these issues, java1.6.0_11 changed the UTF-8 charset implementation
to disregard the "non-shortest form".

Here are my questions:
1. How does UTF-8 identify that a byte sequence is illegal? That the
sequence is in the non-shortest form?
2. Who/how does "non-shortest form" be encoded?
    For xml files for example, who transforms these characters into
bytes which in turn could turn into "shortest" or "non-shortest"?

   When a program reads from an xml file with UTF-8 encoding, is it
possible that the byte decoded is in "non-shortest form?"

hope somebody could help me understand this.
thanks so much

Next message: Rick McGowan: "Re: Unicode Haiku Contest"
Previous message: Marion Gunn: "Re: [Unicode Announcement] Unicode Haiku Contest"
Next in thread: Michael D'Errico: "Re: [question] UTF-8 issue"
Reply: Michael D'Errico: "Re: [question] UTF-8 issue"
Reply: John (Eljay) Love-Jensen: "RE: [question] UTF-8 issue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 08 2009 - 10:49:54 CDT