From: John Burger (john@mitre.org)
Date: Wed Apr 01 2009 - 09:25:05 CST
Asmus Freytag wrote:
> It does look like most of your examples represent two-byte escapes
> with
> each byte associated with a unique character.
> 5e = é
> e5 = s
Tom Gewecke wrote:
> One possible way for that to happen: Latin-1 és is represented by
> the bytes E9 73. Read as Big5, it becomes E973 廥. The Unicode
> point for that character is 5EE5.
Wow! I didn't even notice the regularity Asmus picked up on, let
alone imagine the telephone game with encodings that Tom suggests.
Impressive!
James Kass wrote:
> It looks as though some of the data, though, is already in CJK
> characters.
Yes - I was a bit unclear. This data comes from documents scraped off
the UN web site, some in Chinese, some in English. The Chinese was
(supposedly) in one of the GBs, but there's no reason to think the
English documents were. The munged-up escape codes have very
different distributions in the two languages - the vast majority in
the Chinese segments are just two codes, e76f and e010. These are
both private use codepoints, right? So I'm not sure the same process
produced these as the examples in the English data. Here's some more
examples in context:
(a)\x{e76f}大会根据《宪章》第十条具体建议减少可适用
否决权的领域;
(b)\x{e76f}现有常任理事国个别或集体地书面声明,
69.\x{e010}在日本和其他一些国家共同采取的主动行动的
影响下,
70.\x{e010}关于南-南三角合作,日本承认最近几年已取得
相当的进展。
As I said, though, these seem to simply be spacing characters of some
sort. I think I can come up with a 90% solution for this mess -
thanks for everyone's help!
- John Burger
MITRE
This archive was generated by hypermail 2.1.5 : Wed Apr 01 2009 - 09:28:00 CST