"scripting units" vs "scripting bits"

From: spir (denis.spir@free.fr)
Date: Thu Feb 25 2010 - 03:05:37 CST

  • Next message: Jon Hanna: "Re: "scripting units" vs "scripting bits""


    I read somewhere, and some time ago(*), that the Unicode concept of character matches the common sense of "character" in computing. I find this assertion rather amazing, but maybe it simply means I do not properly understand how things work. Here is an explanation of my view that you can happily criticize or correct.

    Let us say a string of codes represents a text made of characters. I guess (see below) in most if not all legacy character sets (= specific to a script/lang, not unicode) the following mapping applies:
        string <--> text
        code <--> character

    Now, let us introduce a distinction between "scripting units" & "scripting bits". The former are more or less what in Unicode is called "graphemes" or "user characters"; the latter get composed into units. Bits are atoms of the molecular units. For instance, in Unicode, the unit 'â' may be formed out of the bits 'a' and the composing variant of '^'.

    It seems to me in legacy characters sets scripting bits simply do not exist, but I may be wrong on this. If not, codes always represent whole units (composite characters are precomposed):
        string <--> text
        code <--> character = scripting unit
    In other words, the representation matches the "idea" that a text is formed out of characters (even if some characters are special, like controls).

    In Unicode, the distinction between units & bits mirrors in the representation:
        string <--> text
        stack <--> grapheme = scripting unit
        code <--> character = scripting bit
    The term "stack" is taken from the unicode doc introducing the algorithm to group codes into stacks -- I find it great ;-) because it clearly shows a stack logically takes a single place in the string:
    "c u r a ç ã o "
                         0327 0303
    (0063 0075 0072 0061 0063 0061 006F) stack string
    (0063 0075 0072 0061 (0063 0327) (0061 0303) 006F) grouped code string
    "c u r a ç ã o "
    (0063 0075 0072 0061 0063 0327 0061 0303 006F) raw code string
    A logical trap is that many units are formed out of a single bit (while most molecules hold several atoms ;-), so that most stacks are singletons. This fact may let us overlook the distinction, even if it's imo fondamental.

    I wonder whether I am right about legacy character sets. Are there some working like Unicode, meaning they have the concept of scripting bits? In other are there character sets which codes do not represent graphemes, but characters in the sense of Unicode?

    It seems to make much sense for instance in the case of a syllabic script (if the script is purely phonetic, meaning there is a 1-->1 mapping of oral phoneme conbinations to graphic syllabs): codes may represent phonemic bits, to be combined into whole graphemes. (But there would be a need for a separator code (eg 0) in ambiguous cases).


    (*) Unfortunately I lost the reference; pointer welcome.

    la vita e estrany


    This archive was generated by hypermail 2.1.5 : Thu Feb 25 2010 - 03:11:31 CST