Re: Surrogate points

From: Hans Aberg (haberg@math.su.se)
Date: Thu Feb 03 2005 - 14:27:26 CST

  • Next message: D. Starner: "Re: Surrogate points"

    Philippe Verdy <verdy_p@wanadoo.fr>

    [Off the list.]
    >> The problem with Unicode is that it seems to attempt to do to much
    >> in this category. It should focus on the character model, and let the
    >> other things to other work and standards. That would remove a great
    >> deal of controversy around it.
    >
    >At least on this, I can agree with you.
    >I think that Unicode's attempt to cover too many things in the same
    >standard will fail in a more or less long term. The Unicode standard
    >should be splitted into separate working domains.

    I think this is what is causing a lot of heat on the list, a combination of
    trying to do too much, in combination with the requirement of no change.
    Then, a series of issue have been resolved by compromises, which are not
    generally useful. When complaints naturally arrive, one has locked ones
    positions in view of the non-change requirement, which produces a rather
    defensive, aggressive stance on the list.

    >Collation for example is not bound to the character encoding itself. I
    >think it should go out of Unicode's standard, and be worked by another
    >group, without being bound to Unicode character properties.

    Other issues that should not be in Unicode are file encodings, process
    handling. Also the endianess of the representation of number is languages
    seems to be wrong. So there seems to be a range of issues that should be
    lopped off current Unicode.

    >I think that ISO would be a better place to work on collation, because
    >it's not a character encoding issue, but a more general issue about
    >handling linguistic data and semantics. A unique solution for collation
    >will not work for all languages. I think that a more open standard that
    >will be based on various profiles (including Unicode's UCA as one of
    >those profiles) with more precise but more open definitions bound in
    >priority to linguistic issues would be welcome.

    It has been discussed a bit in the LaTeX list, and it is clear that these
    language and region related issues are very complex. Other issues are how to
    represent dates in various localities, where the same language, but
    different locality will use different conventions. For example, Australian,.
    UK, and US conventions. Then people may make a pick between different
    conventions in their text. So if Unicode stick its nose into those water,
    one is likely to get water over the head.

    >May be Unicode.org
    >could become the registry agency for those profiles (for example if the
    >registry is made part of CLDR). But UCA and the Unicode's DUCET is
    >unusable as such. New collation algorithms are needed that will make
    >things simpler and more efficient to cover large sets of languages for
    >which the algorithm is poor (inprecise or ambiguous) and inefficient
    >(slow, complex to implement)
    >
    >On the opposite, working groups on collation categorized by linguistic
    >domains could be created at ISO, to cover several groups of languages,
    >based only on the ISO10646 character repertoire, and with their own
    >sets of character properties independant of Unicode, these properties
    >becoming significant only for the covered languages.

    I not myself a strong believer that everything has to be in the form of
    standards: A formal standard requires there the issue can be pinned down
    fairly accurately, which sets limitation.

    >another example: the set of normative properties in Unicode needed
    >before characters can be standardized is becoming too large. This is a
    >critical issue, because it is slowing the standardization process also
    >at ISO10646. So Unicode tends to assign too early normative properties
    >that will become unusable later, and that will require application
    >developers to use their own non-compliant solutions (examples found and
    >needed today with Biblic Hebrew and Phoenician).

    That seems to be problem with Unicode: by wanting to do too much, one will
    provide norms that merely will be disobeyed. THis is a gener problem with
    standards, not only Unicode. Therefore, quite few standrds will never be in
    effect used.

    >Splitting the standard would help abandonning some parts of it in favor
    >of other ones. So applications could be still conforming to ISO10646
    >and a reduced Unicode standard, but could adopt other standards for all
    >other domains not covered by the core Unicode standard. doing this
    >should not require reencoding texts. But it could really accelerate the
    >production of documents with still unencoded scripts or characters.

    I think that Unicode should focus on providing the character set, the
    character numbering, and in some cases, rules for combined characters. If
    the encoding issue would have been handled correctly, it would have been
    completely independent of these issues. Then to this, add series of other
    protocols for file format, and so on. No such other protocol will have truly
    universal spread, but may be fairly special purpose. So it can be hard to
    know what takes on.

    >Finally, Unicode does not cover a domain which is very important for
    >the creation of numeric text corpus: orthographies (and their
    >associated conventions). This is a place where nothing can be
    >standardized without breaking some Unicode conformance level, even
    >though standard orthographies could be much more easily developed based
    >only on the ISO10646 repertoire definition.

    This clearly belongs to the question of lexing and parsing a sequence of
    characters. Unicode should stay out of that as much as possible, I think.

    >So the good question is: are all those character properties in Unicode
    >needed or pertinent to cover all languages of the world? Unicode has no
    >right and no expertise in linguistic issues; only in encoding issues.

    I can think of more than one character set, going in different directions
    relative Unicode: One that is optimized by having as few characters as
    possible. Another, going the opposite direction, might be more ample in the
    set of characters, perhaps having one for each language-locality combination
    that is unique. I do not think there is one set that is the right one; it
    depends on what design objectives on has.

    >Unicode claims being not a standard for glyph processing, but it really
    >dictates how glyph processing must be done, and this causes unnecessary
    >complexity to renderers or font designers.

    I do not know so much about that, except that wrestling TeX to produce
    correct typeset output is extremely difficult in non-English languages. It
    is, in part, a question of how much traditional typesetting and rendering
    one should honor. With the modern computer tools, it is perhaps best to find
    rendering techniques adapted to that medium. That will work at least for
    modern new texts. It is hard if one should provide correct rendering for
    historical texts. There scholars must stick to what actually was in use.
    Those issues are extremely complex. If Unicode puts in conflicting
    requirements there, people will be forced to ignore it.

    In the LaTeX list, some folks wanted to standardize mathematical notation. I
    had to explain that mathematics has set of conventions that mathematicians
    use, but there is no real standard. Further, if somebody would attempt to
    standardize it, it would merely be ignored, because what dictates
    mathematical notation is a set of local traditions and the attempt to bring
    out the mathematical contents. Other mathematicians have independently
    remarked the same thing. So this gives one example, where attempting to
    standardize is not so wise.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Feb 03 2005 - 14:31:42 CST