[Fwd: [question] UTF-8 issue--update]

From: Chat S. Depasucat (cdepasucat@ntsp.nec.co.jp)
Date: Thu Oct 08 2009 - 05:56:49 CDT

  • Next message: Marion Gunn: "Re: [Unicode Announcement] Unicode Haiku Contest"

    update:

    the java program used InputStreamReader in reading the xml file, and uses
    StreamDecoder.

    Is this safe enough not to generate "non-shortest form"?
    do i have nothing to worry about.?

    thanks a lot.


    attached mail follows:



    im really thankful that i get to find this mailing list.

    i have few UTF-8 issues that I wish somebody could give light on:

    I understand that in UTF-8 encoding, Unicode characters can be
    represented in more than one way.
    Like for the US ASCII characters, it can be represented as "shortest
    form" and "non-shortest form".
    With these issues, java1.6.0_11 changed the UTF-8 charset implementation
    to disregard the "non-shortest form".

    Here are my questions:
    1. How does UTF-8 identify that a byte sequence is illegal? That the
    sequence is in the non-shortest form?
    2. Who/how does "non-shortest form" be encoded?
        For xml files for example, who transforms these characters into
    bytes which in turn could turn into "shortest" or "non-shortest"?
       
       When a program reads from an xml file with UTF-8 encoding, is it
    possible that the byte decoded is in "non-shortest form?"

    hope somebody could help me understand this.
    thanks so much



    This archive was generated by hypermail 2.1.5 : Thu Oct 08 2009 - 10:43:07 CDT