Re: Problem with accented characters

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Aug 23 2004 - 13:51:32 CDT

  • Next message: Deborah Goldsmith: "Re: Problem with accented characters"

    Problem with accented charactersWilliam Tay wrote:

    > Can anyone explain why an accented character is sometimes represented
    > as a base character plus its accent? For example, the utf-8
    > representation for é is 65 CC 81, which is the utf-8 representation
    > for e and the accent, instead of C3 A9? I find that this is how MacOS
    > X represents accented characters.

    The two characters U+0065 and U+0301 (é) are canonically equivalent to
    the single character U+00E9 (é). That is, the two-character combining
    sequence is supposed to be considered equivalent to the single
    precomposed character. Apparently MacOS X, or at least one application
    running under it, does use the combining sequence.

    > How can a C application that receives such utf-8 encoded characters
    > handle them correctly? Appreciate your comments.

    It must understand normalization. See TUS 4.0, section 5.6 for more
    information.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Mon Aug 23 2004 - 13:52:37 CDT