Re: Roman numerals in non-latin text

From: Philippe Verdy ([email protected])
Date: Thu Jun 12 2003 - 15:09:36 EDT

Next message: Asmus Freytag: "Updated: Unicode TR#20 "Unicode in XML""

Previous message: Jim Allan: "Re: Caron / Hacek?"
In reply to: Pim Blokland: "Re: Roman numerals in non-latin text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Pim Blokland <[email protected]> wrote:
> No. Encoded like that it may *look* like a roman three, but two of
> those are definitely not correct. Only U+2162 or its compatibility
> decomposition, U+0049 U+0049 U+0049 should be used. The other two
> are bad coding, just as using greek Iotas or combinations of U+2160
> and U+0049 would be.

It may happen when the text was initially encoded with a legacy
encoding, then converted to Unicode.

With legacy encodingsand input methods, users tend to input the
characters they have on their keyboard, and will not use the
complicated keystrokes needed to enter Latin letters, when the
supported encoding does not have any support for Roman numerals.

So you'll find Roman numerals encoded with Greek letters in many
Greek texts, or with Cyrillic letters in Russian text...

That's not uncommon, and in these legacy encodings, this were
relly considered as a compatibility decomposition, even if this does
not appear in the Unicode decompositions.

In fact, most Latin, Greek and Cyrillic characters have a common
origin, and inherited of the same glyph designs and many common
uses from each script. Unicode did not attempt to unify them
even if theorically it could have been done. But it was a compromize
as these legacy encodings often include both Latin letters and
Greek letters, or Latin and Cyrillic letters where they were initially
not unified as well. The choice was to preserve the bijective compatibility
with those widely used encodings, and maintain the difference as the
characters also imply language differences, and normally different
contexts that a unification in Unicode would have lost.

These missing unifications are commented in the character charts,
but not present in the official compatibility decompositions.

However an unification is possible later, if the text contains
indications of the language used, which can provide the restricted
set of characters used in that language and the most widely used
legacy encodings where such historic uses are common.

Next message: Asmus Freytag: "Updated: Unicode TR#20 "Unicode in XML""
Previous message: Jim Allan: "Re: Caron / Hacek?"
In reply to: Pim Blokland: "Re: Roman numerals in non-latin text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jun 12 2003 - 15:53:11 EDT