Re: Major Defect in Combining Classes of Tibetan Vowels

From: Philippe Verdy (
Date: Wed Jun 25 2003 - 19:59:42 EDT

  • Next message: John Hudson: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"

    On Thursday, June 26, 2003 1:04 AM, Andrew C. West <> wrote:

    > On Wed, 25 Jun 2003 13:41:27 -0700 (PDT), Kenneth Whistler wrote:
    > >
    > > Peter asked:
    > >
    > > > How can things that are visually indistinguishable be lexically
    > > > different?
    > >
    > > chat (en)
    > > chat (fr)
    > And if Unicode reordered vowels in front of consonants, then we
    > wouldn't be able to distinguish :
    > chat (en)
    > chat (fr)
    > acht (de)
    > Andrew

    Such distinction by language is futile: you try to add a language-specific lexical meaning, that simply does not exist in Unicode which only standardizes the *script* so that it *can* be rendered correctly independantly of the actual language...

    So you need to assume a unique language when interpreting an encoded string, but this is out of scope of Unicode (which at best will define language-dependant character properties, but not language-dependant canonical equivalences.

    When Unicode defines such canonical equivalence, the contract must be *only* based on the rendered text: if the text is rendered identically so that it becomes impossible to determine which order was used to encode it in abstract character sequences, then all these orders should be made canonically equivalent.

    The only exception is for abstract character propertiesn, which MUST be language independant for normative properties (the only exception is character transformations such as case mappings, which change the semantic of the text) but need sometimes to be distinct for correct processing in the rendering process (for example the Mathematics Symbol category and the Letter category, as they influence the layout in actual renderers, notably for the choice of font styles or point sizes or alignment, or extraction of entities sharing a common set of properties, such as breaking rules that also influence the correct rendering of text in variable display environments with different capabilities).

    Labelling the text with extra information such as language or word semantics or phonetic values is not part of the Unicode standard. The Unicode standard stops at the point where a text *can* be rendered with its original semantics, and this excludes all phonological, phonetical, or logical ordering analysis that can be made equivalently on the rendered text or on the encoded text.

    -- Philippe.

    This archive was generated by hypermail 2.1.5 : Wed Jun 25 2003 - 20:39:28 EDT