Encoding issue, clues needed

From: Jeroen Ruigrok van der Werven (asmodai@in-nomine.org)
Date: Sun Dec 23 2007 - 15:15:38 CST

  • Next message: George W Gerrity: "Re: Encoding issue, clues needed"

    On my FreeBSD system I am trying to track down an encoding issue with ncurses
    and Python. After having beating my head against it for the entire day I
    figured someone on this list would have a clue.

    Characters from the basic latin block are ok, but any multibye character seems
    to get mangled in one way or other.

    For example, a character such as 的 (U+7684) gets changed to U+fffd ~Z ~D. In
    general most characters get transformed to a U+fffd + ~.. + ~.. sequence.
    Where .. is a Basic Latin printable character (apparently between U+0040 -

    I am not seeing a, probably, very simple mangling. Even having written out
    everything in bit and hex sequences did not show much of a system, aside from
    the last digit being preserved, e.g. U+7684 still has an 4 at the end since D
    is U+0044.

    Python uses UTF-16 (UCS-2) internally and my locale, to which everything is
    decoded, uses UTF-8.

    So to take the example, U+7684 would be e7.98.84 in UTF-8 and the sequence I
    got, aside from U+fffd, is 7e.5a.7e.44.

    To give two more examples for completeness sake:

    居 - U+5c45 - e5.b0.85 - U+fffd ~E (7e.45)
    把 - U+628a - e6.88.8a - U+fffd ~J~J (7e.4a.7e.4a)

    Is this some sort of signed/unsigned issue?

    Mmm, of course, ideas strike when you are about to send this...

    í - U+00ed gives me two U+fffd, in UTF-8 it would c3.ad, which are both above
    7e. There's some cut off happening, but I am not seeing in which direction I
    need to continue seeking.

    Ideas are very much welcome!

    Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
    イェルーン ラウフロック ヴァン デル ウェルヴェン
    http://www.in-nomine.org/ | http://www.rangaku.org/
    In every stone sleeps a crystal...

    This archive was generated by hypermail 2.1.5 : Sun Dec 23 2007 - 15:19:05 CST