Re: Perhaps OT: Mysterious escape sequences in UN data

From: John Burger (
Date: Wed Apr 01 2009 - 09:25:05 CST

  • Next message: Michael Everson: "Oxford proposes a leaner alphabet"

    Asmus Freytag wrote:

    > It does look like most of your examples represent two-byte escapes
    > with
    > each byte associated with a unique character.
    > 5e = é
    > e5 = s

    Tom Gewecke wrote:

    > One possible way for that to happen: Latin-1 és is represented by
    > the bytes E9 73. Read as Big5, it becomes E973 廥. The Unicode
    > point for that character is 5EE5.

    Wow! I didn't even notice the regularity Asmus picked up on, let
    alone imagine the telephone game with encodings that Tom suggests.

    James Kass wrote:

    > It looks as though some of the data, though, is already in CJK
    > characters.

    Yes - I was a bit unclear. This data comes from documents scraped off
    the UN web site, some in Chinese, some in English. The Chinese was
    (supposedly) in one of the GBs, but there's no reason to think the
    English documents were. The munged-up escape codes have very
    different distributions in the two languages - the vast majority in
    the Chinese segments are just two codes, e76f and e010. These are
    both private use codepoints, right? So I'm not sure the same process
    produced these as the examples in the English data. Here's some more
    examples in context:


    As I said, though, these seem to simply be spacing characters of some
    sort. I think I can come up with a 90% solution for this mess -
    thanks for everyone's help!

    - John Burger

    This archive was generated by hypermail 2.1.5 : Wed Apr 01 2009 - 09:28:00 CST