Re: Perhaps OT: Mysterious escape sequences in UN data

From: John Burger (john@mitre.org)
Date: Wed Apr 01 2009 - 09:25:05 CST

  • Next message: Michael Everson: "Oxford proposes a leaner alphabet"

    Asmus Freytag wrote:

    > It does look like most of your examples represent two-byte escapes
    > with
    > each byte associated with a unique character.
    > 5e = é
    > e5 = s

    Tom Gewecke wrote:

    > One possible way for that to happen: Latin-1 és is represented by
    > the bytes E9 73. Read as Big5, it becomes E973 廥. The Unicode
    > point for that character is 5EE5.

    Wow! I didn't even notice the regularity Asmus picked up on, let
    alone imagine the telephone game with encodings that Tom suggests.
    Impressive!

    James Kass wrote:

    > It looks as though some of the data, though, is already in CJK
    > characters.

    Yes - I was a bit unclear. This data comes from documents scraped off
    the UN web site, some in Chinese, some in English. The Chinese was
    (supposedly) in one of the GBs, but there's no reason to think the
    English documents were. The munged-up escape codes have very
    different distributions in the two languages - the vast majority in
    the Chinese segments are just two codes, e76f and e010. These are
    both private use codepoints, right? So I'm not sure the same process
    produced these as the examples in the English data. Here's some more
    examples in context:

    (a)\x{e76f}大会根据《宪章》第十条具体建议减少可适用
    否决权的领域;
    (b)\x{e76f}现有常任理事国个别或集体地书面声明,
    69.\x{e010}在日本和其他一些国家共同采取的主动行动的
    影响下,
    70.\x{e010}关于南-南三角合作,日本承认最近几年已取得
    相当的进展。

    As I said, though, these seem to simply be spacing characters of some
    sort. I think I can come up with a 90% solution for this mess -
    thanks for everyone's help!

    - John Burger
       MITRE



    This archive was generated by hypermail 2.1.5 : Wed Apr 01 2009 - 09:28:00 CST