Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 29 2010 - 23:04:26 CDT

  • Next message: Martin J. Dürst: "Re: High dot/dot above punctuation?"

    "karl williamson" <public@khwilliamson.com> wrote:
    > This discussion doesn't make sense to me. The original proposal to
    > encode 19DA says that there is one set of digits in New Tai Lue, but
    > there is an extra digit '1' (the one that got put at 19DA), used when
    > the other digit '1' is visually confusable with another character in the
    > script, which it resembles. That makes it sound like the two are
    > essentially used as glyph variants of each other, and are
    > interchangeable as far as the computer recognizing an input number.

    Yes, the exception will work for recognizing this digit as an
    exception for INPUT, but you still have a problem for output, because
    your library will need to know when to output the variant : if you
    always use the default digit 1, you'll create a string that is
    possibly confusable to the reader, notably if it appears alone with no
    other digit.

    So you'll still need an exception to change one or several of these
    digits 1, to use the variant, or you'll decide to always use the
    variant (which causes no confusion), but I'm not sure that such use
    would be valid in the target language. There are possibly complex
    rules deciding when the variant is needed and accepted, or when the
    default variant is preferable and not confusable.

    For Arabic ther are clearly two separate sets of digits, but the
    possibility of mixing them arbitrarily is still a problem for IDNA (if
    both sets are accepted), notably because most digits (except 4 to 6)
    are completely identical. So registries will have to:
    - either accept one set and reject the other one
    - accept both, but only one within the same domain label, reserving
    also the label using the other set (as if they were canonically
    equivalent).

    Such equivalences (which are definitely not canonical) can be handled
    by tailored collation compares (operating at collation level 2 only,
    when non-IDN registries operate only at level 1), where IDN registries
    will use their own tailoring. I just see the IDN "StringPrep" as a
    particular application of the general concept of collation mappings
    (except that it was not designed on linguistic bases, but an IDN
    registry can be viewed as a locale for collation purposes). All these
    complex rules and mappings of IDN can be written in terms of a set
    collation rules, added on top of the DUCET.



    This archive was generated by hypermail 2.1.5 : Thu Jul 29 2010 - 23:07:14 CDT