Re: Hebrew script in IDN (was Exemplar Characters)

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Nov 18 2005 - 14:22:56 CST

  • Next message: Mark E. Shoulson: "Re: Hebrew script in IDN (was Exemplar Characters)"

    I was a bit too brief, and should have given more context. These were
    characters that were called out in UAX#29 as word characters (some for
    language tailorings). The document was for review, not proposing that
    each of them be added to default identifiers.

    Of the set:
    [\u0027\u002D\u002E\u003a\u00b7\u058a\u05f3\u05f4\u200c\u200d\u2010\u2019\u2027\u30a0]

    1. [·] is already in the default identifiers (xid_continue).

    2. [\- ‐ \: . ' ’ ‧] and [\u200C \u200D] are ineligible for inclusion
    in the default identifiers, since they are in pattern-syntax or are
    normally invisible, resp.

    3. [\u30a0 ׳ ״] are the only ones that would be possible additions to
    default identifiers.

    4. In addition to possible addition to the default identifiers, the
    consortium does recommend an identifier *profile* for IDN in
    http://www.unicode.org/draft/reports/tr36/tr36.html. Any of #2 or #3
    could be separately proposed for addition to that profile.

    But in any event, any submitted proposal has to make a good case that
    the characters are required, and that their addition will not cause
    problems.

    Mark

    Neil Harris wrote:

    > Mark Davis wrote:
    >
    >> It is not that clear-cut. Identifiers by their nature cannot include
    >> all words and phrases valid in all languages. For IDN, for example,
    >> one can't express the perfectly reasonable English word "can't", or a
    >> word like "I.B.M.".
    >>
    >> I did introduce a proposal in March for considering the status of
    >> some word characters, which turned into a discussion into the UTC of
    >> whether to add certain items to the identifier definition.
    >>
    >> http://www.unicode.org/L2/L2005/05083-wordprops.txt
    >>
    >> (I'll copy that section here for those without access:
    >>
    >> 0027 ; # Po APOSTROPHE
    >> 002D ; # Pd HYPHEN-MINUS
    >> 002E ; # Po FULL STOP
    >> 003A ; # Po COLON
    >> 00B7 ; # Po MIDDLE DOT
    >> 058A ; # Pd ARMENIAN HYPHEN
    >> 05F3 ; # Po HEBREW PUNCTUATION GERESH
    >> 05F4 ; # Po HEBREW PUNCTUATION GERSHAYIM
    >> 200C ; # Cf ZERO WIDTH NON-JOINER // for Indic?
    >> 200D ; # Cf ZERO WIDTH JOINER // for Indic?
    >> 2010 ; # HYPHEN
    >> 2019 ; # Pf RIGHT SINGLE QUOTATION MARK
    >> 2027 ; # Po HYPHENATION POINT
    >> 30A0 ; # Pd KATAKANA-HIRAGANA DOUBLE HYPHEN
    >>
    >>
    >> The UTC decided that against adding them to the identifier
    >> definition. If we were to change that for the Hebrew punctuation, we
    >> would have to see a documented case for it.
    >>
    >> Mark
    >>
    >
    > Mark,
    >
    > I think you might meet some opposition to including the following in
    > IDNs:
    >
    > APOSTROPHE (?protocol character)
    > FULL STOP (it's a label separator: so no chance for use in IDN labels)
    > COLON (a definite protocol character in URLs)
    > ZWNJ and ZWJ (unless Indic experts can make a _very_ good case for
    > these being used only in contexts where they cause _visible_ and
    > _unambiguous_ rendering changes)
    > RIGHT SINGLE QUOTATION MARK (spoof of APOSTROPHE)
    > HYPHENATION POINT (spoof of MIDDLE DOT)
    > KATAKANA-HIRAGANA DOUBLE HYPHEN (spoof of EQUALS SIGN, ?protocol
    > character)
    >
    > which leaves only
    >
    > 00B7 ; # Po MIDDLE DOT
    > 058A ; # Pd ARMENIAN HYPHEN
    > 05F3 ; # Po HEBREW PUNCTUATION GERESH
    > 05F4 ; # Po HEBREW PUNCTUATION GERSHAYIM
    >
    > as characters which I would consider possible uncontroversial
    > candidates for IDN.
    >
    > -- Neil
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Nov 18 2005 - 14:25:13 CST