Re: Converting EBCDIC to Unicode

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Feb 12 2003 - 02:47:37 EST

  • Next message: Andy White: "RE: Indic Vowel/Consonant combinations"

    Markus Scherer <markus dot scherer at jtcsv dot com> wrote:

    >> They are all the same in the A-Z, a-z, and 0-9
    >> ranges, but beyond that they can differ substantially.
    >
    > There are some more characters that have the same codes in most EBCDIC
    > codepages, but there are also some where the Latin letters are not all
    > present. (I think some old Japanese EBCDIC codepages replace small
    > Latin letters with Katakana ones.)

    Indeed, I was oversimplifying things a bit. There are other invariant,
    or almost-invariant, EBCDIC characters. For example, SPACE had better
    be invariant or there will be serious problems!

    In 1997 I did a quick study of all the EBCDIC code pages on the DKUUG
    FTP site, which I think was about 25 code pages, and made a list of the
    characters that were the same in every single page:

    0x40 SPACE
    0x4B .
    0x4D (
    0x4E +
    0x5C *
    0x5D )
    0x5E ;
    0x60 -
    0x61 /
    0x6B ,
    0x6D _
    0x6E >
    0x6F ?
    0x7A :
    0x7D '
    0x7E =

    as well as the letters and numbers already mentioned:

    0x81-0x89 a-i
    0x91-0x99 j-r
    0xA2-0xA9 s-z
    0xC1-0xC9 A-I
    0xD1-0xD9 J-R
    0xE2-0xE9 S-Z
    0xF0-0xF9 0-9

    There were some other characters that were the same in ALMOST all code
    pages, such as the ampersand at 0x50. I think it was some kind of Greek
    EBCDIC page that put a different character at 0x50. Amusingly, the
    greater-than sign is constant at 0x6E, but the less-than sign (though
    always present) is not on the list because it floats among different
    character positions.

    The DKUUG site may not have included the Katakana code page that Markus
    mentioned, although such a thing is described extensively in Chapter 18
    of Mackenzie. Doubtless there are other versions of EBCDIC that assign
    different characters to even these "invariant" code positions. Putting
    an end to this kind of thing is one of the reasons we love Unicode.

    -Doug Ewell
     Fullerton, California



    This archive was generated by hypermail 2.1.5 : Wed Feb 12 2003 - 03:28:44 EST