Re: Re: Word dividers, was: proposals I wrote (and also, didn't write)

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Wed Dec 08 2004 - 09:11:22 CST

  • Next message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    > De : "Michael Everson"
    > > > But there is already in the pipeline a PHOENICIAN WORD SEPARATOR
    > >[...] The glyphs for
    > > > all of these seem indistinguishable, and so are the functions. The only
    > > > difference seems to be the scripts they are associated with, but
    > > > punctuation marks are supposed to be not tied to individual scripts.
    >
    > Read the proposal. It is not always a dot.
    >
    > John said:
    >
    > >We already have gobs of dots. It's one of those things: on the
    > >other hand, Unicode unifies all the Indic dandas, for example.
    >
    > Not for long, one hopes. And other Brahmic dandas are not unified.

    Why would there be too many dots in Unicode? Unicode does not encode glyphs, but abstract characters nearly independantly of their glyph. The need to encode them is justified by distinct semantics, distinct layout rules, and the need to make each encoded script coherent with itself, with appropraite character properties not wildly and abusively borrowed from other scripts that have their own rules...

    It's true with the exception of Latin/Greek/Cyrillic or Hiragana/Katakana that have so many interactions that they share the same set of diacritics (for now they are in a block considered generic, but in fact I really think that this genericity should not be abused, and that possibly Unicode could define more precisely to which script family they apply; I see for example little interest in considering the COMBINING DOT ABOVE useful for something else than Greek/Cyrillic/Latin (possibly a few other historic scripts), and that if another script needs a ombining dot above, it should be encoded separately for that script, with its own name and its own properties.

    There are probably lots of missing properties for combining characters, notably layout interaction properties that are not accurately represented by combining classes (which just define accurately the canonical equivalences, but not the significant equivalences). For me it's part of the Unicode job to document and standardize them. Same thing for Hangul jamos (notably the historic ones, but also SSANG-letters) which should have additional normative properties related to their actual composition and layout.



    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 09:15:21 CST