From: Mark Davis (email@example.com)
Date: Fri Nov 18 2005 - 14:22:56 CST
I was a bit too brief, and should have given more context. These were
characters that were called out in UAX#29 as word characters (some for
language tailorings). The document was for review, not proposing that
each of them be added to default identifiers.
Of the set:
1. [·] is already in the default identifiers (xid_continue).
2. [\- ‐ \: . ' ’ ‧] and [\u200C \u200D] are ineligible for inclusion
in the default identifiers, since they are in pattern-syntax or are
normally invisible, resp.
3. [\u30a0 ׳ ״] are the only ones that would be possible additions to
4. In addition to possible addition to the default identifiers, the
consortium does recommend an identifier *profile* for IDN in
http://www.unicode.org/draft/reports/tr36/tr36.html. Any of #2 or #3
could be separately proposed for addition to that profile.
But in any event, any submitted proposal has to make a good case that
the characters are required, and that their addition will not cause
Neil Harris wrote:
> Mark Davis wrote:
>> It is not that clear-cut. Identifiers by their nature cannot include
>> all words and phrases valid in all languages. For IDN, for example,
>> one can't express the perfectly reasonable English word "can't", or a
>> word like "I.B.M.".
>> I did introduce a proposal in March for considering the status of
>> some word characters, which turned into a discussion into the UTC of
>> whether to add certain items to the identifier definition.
>> (I'll copy that section here for those without access:
>> 0027 ; # Po APOSTROPHE
>> 002D ; # Pd HYPHEN-MINUS
>> 002E ; # Po FULL STOP
>> 003A ; # Po COLON
>> 00B7 ; # Po MIDDLE DOT
>> 058A ; # Pd ARMENIAN HYPHEN
>> 05F3 ; # Po HEBREW PUNCTUATION GERESH
>> 05F4 ; # Po HEBREW PUNCTUATION GERSHAYIM
>> 200C ; # Cf ZERO WIDTH NON-JOINER // for Indic?
>> 200D ; # Cf ZERO WIDTH JOINER // for Indic?
>> 2010 ; # HYPHEN
>> 2019 ; # Pf RIGHT SINGLE QUOTATION MARK
>> 2027 ; # Po HYPHENATION POINT
>> 30A0 ; # Pd KATAKANA-HIRAGANA DOUBLE HYPHEN
>> The UTC decided that against adding them to the identifier
>> definition. If we were to change that for the Hebrew punctuation, we
>> would have to see a documented case for it.
> I think you might meet some opposition to including the following in
> APOSTROPHE (?protocol character)
> FULL STOP (it's a label separator: so no chance for use in IDN labels)
> COLON (a definite protocol character in URLs)
> ZWNJ and ZWJ (unless Indic experts can make a _very_ good case for
> these being used only in contexts where they cause _visible_ and
> _unambiguous_ rendering changes)
> RIGHT SINGLE QUOTATION MARK (spoof of APOSTROPHE)
> HYPHENATION POINT (spoof of MIDDLE DOT)
> KATAKANA-HIRAGANA DOUBLE HYPHEN (spoof of EQUALS SIGN, ?protocol
> which leaves only
> 00B7 ; # Po MIDDLE DOT
> 058A ; # Pd ARMENIAN HYPHEN
> 05F3 ; # Po HEBREW PUNCTUATION GERESH
> 05F4 ; # Po HEBREW PUNCTUATION GERSHAYIM
> as characters which I would consider possible uncontroversial
> candidates for IDN.
> -- Neil
This archive was generated by hypermail 2.1.5 : Fri Nov 18 2005 - 14:25:13 CST