re: "scripting units" vs "scripting bits"

From: verdy_p (
Date: Thu Feb 25 2010 - 08:53:14 CST

  • Next message: Paul Anderson: "Re: Tifinagh - extension for complete common Berber alphabet isomorphic with Latin"

    > De : "spir"
    > I wonder whether I am right about legacy character sets. Are there some working like Unicode, meaning they have
    the concept of scripting bits? In other are there character sets which codes do not represent graphemes, but
    characters in the sense of Unicode?

    You are wrong, even today, many Teletext systems still broadcasted on TV programs use a encoding of diacritics
    separated from the base character (in the details, those where even encoded with sequences of characters, when
    transported over 7 bit streams, so that there was a good compatibility with ISO 646 or the IRV).

    Look also at the various Indian charsets (ISCII standard): viramas and so on.

    Look also at Hebrew and Arabic encodings : vowel points, consonnant modifiers, cantillation marks and so on.

    The need for separate encoding of diacritics has existed since very long, notably when charsets were limited to
    using code units not longer than a single 7-bit or 8-bit byte.

    Even the oldest Japanese JIS standard used a combining character (for the voiced/unvoiced consonnantal modifier
    within its two syllabaries); the same was true for the oldest KSC standard for Korean Jamos (before they could be
    precomposed into syllables), and this remains as well in the most recent KSC standard (which uses variable-length
    multibyte sequences, made of codes belonging to multiple "parallel" sub-codepages.

    The Greek ELOT standard also used separate encoding of the many diacritics needed for writing Polytonic Greek.

    Various proprietary encodings used in printer languages also have defined their own encodings for separate non-
    spacing diacritics (including for the Latin script). It was a natural evolution of the oldest sequences using
    BACKSPACE to be compatible with 646 in a restricted 7 bit environment: extending them to 8 bit did not remove these
    diacritics even if they could be encoded in a simpler way and if more characters could be encoded was a single code

    The time where the ISO 8859 standards were developed and then widely adopted for Latin languages has been quite
    short in the computing history (they still remain, but there's no longer any development on them, and these
    encodings are facing out now rapidly in favor of Unicode/ISO/IEC 10646 "UTF"'s).

    This archive was generated by hypermail 2.1.5 : Thu Feb 25 2010 - 08:56:58 CST