[question] UTF-8 issue

From: Chat S. Depasucat (cdepasucat@ntsp.nec.co.jp)
Date: Thu Oct 08 2009 - 05:03:14 CDT

  • Next message: Rick McGowan: "Re: Unicode Haiku Contest"

    im really thankful that i get to find this mailing list.

    i have few UTF-8 issues that I wish somebody could give light on:

    I understand that in UTF-8 encoding, Unicode characters can be
    represented in more than one way.
    Like for the US ASCII characters, it can be represented as "shortest
    form" and "non-shortest form".
    With these issues, java1.6.0_11 changed the UTF-8 charset implementation
    to disregard the "non-shortest form".

    Here are my questions:
    1. How does UTF-8 identify that a byte sequence is illegal? That the
    sequence is in the non-shortest form?
    2. Who/how does "non-shortest form" be encoded?
        For xml files for example, who transforms these characters into
    bytes which in turn could turn into "shortest" or "non-shortest"?
       When a program reads from an xml file with UTF-8 encoding, is it
    possible that the byte decoded is in "non-shortest form?"

    hope somebody could help me understand this.
    thanks so much

    This archive was generated by hypermail 2.1.5 : Thu Oct 08 2009 - 10:49:54 CDT