Re: Roman numerals in non-latin text

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jun 12 2003 - 15:09:36 EDT

  • Next message: Asmus Freytag: "Updated: Unicode TR#20 "Unicode in XML""

    Pim Blokland <pblokland@planet.nl> wrote:
    > No. Encoded like that it may *look* like a roman three, but two of
    > those are definitely not correct. Only U+2162 or its compatibility
    > decomposition, U+0049 U+0049 U+0049 should be used. The other two
    > are bad coding, just as using greek Iotas or combinations of U+2160
    > and U+0049 would be.

    It may happen when the text was initially encoded with a legacy
    encoding, then converted to Unicode.

    With legacy encodingsand input methods, users tend to input the
    characters they have on their keyboard, and will not use the
    complicated keystrokes needed to enter Latin letters, when the
    supported encoding does not have any support for Roman numerals.

    So you'll find Roman numerals encoded with Greek letters in many
    Greek texts, or with Cyrillic letters in Russian text...

    That's not uncommon, and in these legacy encodings, this were
    relly considered as a compatibility decomposition, even if this does
    not appear in the Unicode decompositions.

    In fact, most Latin, Greek and Cyrillic characters have a common
    origin, and inherited of the same glyph designs and many common
    uses from each script. Unicode did not attempt to unify them
    even if theorically it could have been done. But it was a compromize
    as these legacy encodings often include both Latin letters and
    Greek letters, or Latin and Cyrillic letters where they were initially
    not unified as well. The choice was to preserve the bijective compatibility
    with those widely used encodings, and maintain the difference as the
    characters also imply language differences, and normally different
    contexts that a unification in Unicode would have lost.

    These missing unifications are commented in the character charts,
    but not present in the official compatibility decompositions.

    However an unification is possible later, if the text contains
    indications of the language used, which can provide the restricted
    set of characters used in that language and the most widely used
    legacy encodings where such historic uses are common.



    This archive was generated by hypermail 2.1.5 : Thu Jun 12 2003 - 15:53:11 EDT