RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 11 2003 - 14:26:25 EST

  • Next message: Elaine Keown: "Qumran scribal, again: http://ccat.sas.upenn.edu/rs/dss/marks/review.html"

    jon@hackcraft.net wrote:
    > Beginners, even young children, can get the concept of characters
    > being mapped to numbers. Certainly those young children that will
    > thrive on programming will have a fascination with this process in
    > and off itself (it's just like the kids-in-treehuts type cryptography
    > such kids often like).
    > (...)
    > I don't think characters -> numbers -> bytes -> bits is
    > particularly difficult as programming concepts go, or even é <=>
    > e + ´ when compared to many higher-level string handling activities
    > (regular expressions, bidirectional over-riding, and the subtler points
    > of case operations).
    >
    > Even so, I think it's making those two levels meet that is the biggest
    > stumbling block for beginners.

    Well, if you just consider the concept of writing and learning how to do
    it, the decomposition of the spoken language into words and letters with
    conventional signs to mark them which then creates a second meta-language
    applied to the initial spoken language is just a similar abstraction.

    If children can learn (sometimes with difficulty) how to write and read
    the language they have first learn to speak, with such decomposition
    models made of collections of glyphs, themselves composed in a more or
    less regular way with strokes, we can't assume that it's illogical to
    map grapheme clusters (the nearest model of the written form of languages)
    into abstract characters (that's what children learn at school when they
    learn orthographic and orthographic rules), then code points (similar to
    what they learn when they start collating words by ordering characters with
    more or less complex rules, the simplest one being as simple as counting
    numbers, just because it's necessary to learn how to search in a dictionnary
    or in a phone diary).

    Most literated people stop at this previous step, but then computer
    students learn about code units (what they learn when they start
    programming in most computer languages with completely arbitrary integer
    range limits), then streamed bytes (what they learn when they need to
    transmit their documents and find a way to interchange their local data).

    If there's something which seems natural for all literated people, excluding
    computer students that learn how to write computer programs, it's the level
    of abstraction of code points, not code units. Thanks this is exactly the
    main level at which Unicode and ISO10646 is working on.

    But it is also at that level (decomposition of grapheme clusters to
    abstract characters then into code points) that canonical equivalences and
    normalizing forms are occuring (I exclude there all considerations on code
    units including surrogates, and streamed bytes or bits).

    However the standard C/C++ "string" handling library does not operate at the

    codepoint level (and not even Java) but really in terms of code units
    (whatever their effective sizes in terms representable integer ranges, from
    1 bit to 32 bits, and even quite recently with 64 bit code units). It was
    not designed to operate on code points which is the natural level of
    abstraction for written languages.

    This means that C/C++ or Java strings are NOT a good abstraction of Unicode
    strings. Conformance to Unicode when only the code units level is
    implemented
    is an illusion: such computer languages were not designed to handle natively
    Unicode strings. So these computer languages cannot claim they "support and
    conform to Unicode".

    This is not true however for JavaScript/ECMA-script, and it should not be
    true
    for computer languages like XML, HTML and SGML which were designed
    specifically
    to correctly represent natural written languages.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 15:24:26 EST