RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 11 2003 - 14:26:25 EST

Next message: Elaine Keown: "Qumran scribal, again: http://ccat.sas.upenn.edu/rs/dss/marks/review.html"

Previous message: Benjamin Peterson: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
In reply to: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

jon@hackcraft.net wrote:
> Beginners, even young children, can get the concept of characters
> being mapped to numbers. Certainly those young children that will
> thrive on programming will have a fascination with this process in
> and off itself (it's just like the kids-in-treehuts type cryptography
> such kids often like).
> (...)
> I don't think characters -> numbers -> bytes -> bits is
> particularly difficult as programming concepts go, or even é <=>
> e + ´ when compared to many higher-level string handling activities
> (regular expressions, bidirectional over-riding, and the subtler points
> of case operations).
>
> Even so, I think it's making those two levels meet that is the biggest
> stumbling block for beginners.

Well, if you just consider the concept of writing and learning how to do
it, the decomposition of the spoken language into words and letters with
conventional signs to mark them which then creates a second meta-language
applied to the initial spoken language is just a similar abstraction.

If children can learn (sometimes with difficulty) how to write and read
the language they have first learn to speak, with such decomposition
models made of collections of glyphs, themselves composed in a more or
less regular way with strokes, we can't assume that it's illogical to
map grapheme clusters (the nearest model of the written form of languages)
into abstract characters (that's what children learn at school when they
learn orthographic and orthographic rules), then code points (similar to
what they learn when they start collating words by ordering characters with
more or less complex rules, the simplest one being as simple as counting
numbers, just because it's necessary to learn how to search in a dictionnary
or in a phone diary).

Most literated people stop at this previous step, but then computer
students learn about code units (what they learn when they start
programming in most computer languages with completely arbitrary integer
range limits), then streamed bytes (what they learn when they need to
transmit their documents and find a way to interchange their local data).

If there's something which seems natural for all literated people, excluding
computer students that learn how to write computer programs, it's the level
of abstraction of code points, not code units. Thanks this is exactly the
main level at which Unicode and ISO10646 is working on.

But it is also at that level (decomposition of grapheme clusters to
abstract characters then into code points) that canonical equivalences and
normalizing forms are occuring (I exclude there all considerations on code
units including surrogates, and streamed bytes or bits).

However the standard C/C++ "string" handling library does not operate at the

codepoint level (and not even Java) but really in terms of code units
(whatever their effective sizes in terms representable integer ranges, from
1 bit to 32 bits, and even quite recently with 64 bit code units). It was
not designed to operate on code points which is the natural level of
abstraction for written languages.

This means that C/C++ or Java strings are NOT a good abstraction of Unicode
strings. Conformance to Unicode when only the code units level is
implemented
is an illusion: such computer languages were not designed to handle natively
Unicode strings. So these computer languages cannot claim they "support and
conform to Unicode".

This is not true however for JavaScript/ECMA-script, and it should not be
true
for computer languages like XML, HTML and SGML which were designed
specifically
to correctly represent natural written languages.

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Elaine Keown: "Qumran scribal, again: http://ccat.sas.upenn.edu/rs/dss/marks/review.html"
Previous message: Benjamin Peterson: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
In reply to: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 15:24:26 EST