Re: Perhaps OT: Mysterious escape sequences in UN data

From: Peter Zilahy Ingerman, PhD (pzi@ingerman.org)
Date: Tue Mar 31 2009 - 16:36:05 CST

  • Next message: Tom Gewecke: "Re: Perhaps OT: Mysterious escape sequences in UN data"

    Well, FWIW they aren't the codes used for these characters in
    WordPerfect 6.0 running under DOS.

    Peter Ingerman

    Asmus Freytag wrote:

    > It does look like most of your examples represent two-byte escapes
    > with each byte associated with a unique character.
    > 5e = é
    > e5 = s
    > 66 = m
    > e7 = p
    > 74 = í (i with accent)
    > b2 = g
    >
    > I have no suggestion that would explain the values, but they seem to
    > be consistent, so it should be possible for find a proper context for
    > each byte, and deal with combinations as derived from combinations of
    > byte values (.i.e. as code sequences) rather than treating them as
    > ligatures.
    >
    > A./
    >
    > On 3/31/2009 12:58 PM, John Burger wrote:
    >
    >> Hi -
    >>
    >> I have some parallel Chinese-English UN proceedings scraped from the
    >> UN website some years ago, and further processed by the Linguistic
    >> Data Consortium. I think the data were originally in one of the GB
    >> variants, in MS Word or WordPerfect.
    >>
    >> The data is littered with some odd escape sequences, in both
    >> languages, like this:
    >>
    >> ... Permanent Representatives and Charg\x{5ee5} daffaires of
    >> Kuwait, Burundi ...
    >> -\x{e76f}现?常任?事国 ...
    >>
    >> According to the LDC README, the "\x{}" is their way of escaping
    >> WordPerfect encodings that they could not convert.
    >>
    >> I can guess what some of these are - e76f seems to occur after in
    >> contexts that indicate it's some kind of spacing character, perhaps a
    >> tab. Oddly, most of the rest seem to represent =two= characters.
    >> For instance 5ee5 seems to be "és":
    >>
    >> misleading clich\x{5ee5} that
    >> Mr. Andr\x{5ee5} Pastrana Arango
    >>
    >> Here's some others:
    >>
    >> highlighted by Mr. Rodr\x{74b2}uez
    >> issued by the Espace r\x{5ee7}ublicain
    >> transmitting an aide-m\x{5e66}oire issued
    >>
    >> These seem like odd choices for ligatures. I can correct some of
    >> these, but there are hundreds of different ones. Sorry if I'm
    >> providing insufficient information, but can anyone shed any light on
    >> this?
    >>
    >> Thanks!
    >>
    >> - John D. Burger
    >> MITRE
    >>
    >>
    >>
    >>
    >>
    >
    >
    >------------------------------------------------------------------------
    >
    >
    >No virus found in this incoming message.
    >Checked by AVG - www.avg.com
    >Version: 8.5.285 / Virus Database: 270.11.35/2033 - Release Date: 03/31/09 13:05:00
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Mar 31 2009 - 16:38:35 CST