From: John Burger (firstname.lastname@example.org)
Date: Wed Apr 01 2009 - 09:25:05 CST
Asmus Freytag wrote:
> It does look like most of your examples represent two-byte escapes
> each byte associated with a unique character.
> 5e = é
> e5 = s
Tom Gewecke wrote:
> One possible way for that to happen: Latin-1 és is represented by
> the bytes E9 73. Read as Big5, it becomes E973 廥. The Unicode
> point for that character is 5EE5.
Wow! I didn't even notice the regularity Asmus picked up on, let
alone imagine the telephone game with encodings that Tom suggests.
James Kass wrote:
> It looks as though some of the data, though, is already in CJK
Yes - I was a bit unclear. This data comes from documents scraped off
the UN web site, some in Chinese, some in English. The Chinese was
(supposedly) in one of the GBs, but there's no reason to think the
English documents were. The munged-up escape codes have very
different distributions in the two languages - the vast majority in
the Chinese segments are just two codes, e76f and e010. These are
both private use codepoints, right? So I'm not sure the same process
produced these as the examples in the English data. Here's some more
examples in context:
As I said, though, these seem to simply be spacing characters of some
sort. I think I can come up with a 90% solution for this mess -
thanks for everyone's help!
- John Burger
This archive was generated by hypermail 2.1.5 : Wed Apr 01 2009 - 09:28:00 CST