Re: Code Pages!

From: Markus Scherer (
Date: Thu Jul 24 2003 - 14:18:43 EDT

  • Next message: Peter Kirk: "Re: Hebrew hataf vowels (was: About CGJ)"

    There are many codepages for Indic languages.

    Modern systems support Unicode. It is what Windows and MacOS X and Java and modern web browsers etc.
    use internally - everything else is supported via conversion, which can be problematic.

    The ISCII standard is byte-based and stateful. (Complicated and not widely supported.) It has switch
    commands to go between the Indic scripts, and it also has commands for fancy-text attributes like
    "bold". The latter cannot be handled in plain-text, general-purpose codepage conversion, of course.

    When there are multiple names or codepage numbers for ISCII, that _should_ only be to set the
    default script for conversion from ISCII to Unicode. ISCII text can contain a mix of Indic scripts
    by announcing each change between script runs. The script should be announced before the first Indic
    character appears in the ISCII text.

    One problem with such complex encodings and converters is that two implementations will rarely yield
    the same results, and that it is hard to document the behavior precisely.

    For completeness sake, there are dozens of Indic "font encodings", i.e., someone has drawn a font
    that maps byte values to glyphs. These things are not interoperable at all. Avoid them.

    Summary: Use Unicode.

    Philippe Verdy wrote:
    > There are also errors in IBM ICU/Openi18n resources ...

    If there are errors, then please submit a bug report. If possible, please include references to
    authoritative material and a patch.

    Best regards,

    This archive was generated by hypermail 2.1.5 : Thu Jul 24 2003 - 14:57:17 EDT