L2/07-021

Date: Tue, 16 Jan 2007
Source: Mark Davis
Subject: Customary_Use Property

=================

Based on discussions on the idna-update@alvestrand.no list, it appears that
we will need a property something like the following, so here is a draft for
discussion at the UTC.*

*Property: Customary_Use=True/False

Meaning: characters that are required for the customary orthographies of
modern languages. Excludes historic characters, annotation characters,
astrological signs, deprecated characters, musical notation, vertical
presentation forms, compatibility characters.

Draft values: True for all letter, mark, number characters and joiner
controls, except for the following:

Exclude the following Scripts:
Xsux, Ugar, Xpeo, Goth, Ital, Cprt, Linb, Phnx, Khar, Phag, Glag, Shaw,
Dsrt, Runr

Exclude the following blocks:
Combining_Diacritical_Marks_for_Symbols, Musical_Symbols,
Ancient_Greek_Musical_Notation

Exclude the following ranges of characters (a copy from email from Ken):

Common Diacritics

 omit:    0363..036F

 reason:  These Latin letters above are specialist medievalist
          usage for manuscripts, and are not a part of regular
          orthographies. They would also be quite confusing
          for internet identifiers.

Hebrew

 omit:    0591..05AF, 05C4..05C5

 reason:  0591..05AF are the Hebrew accent marks Cary was talking about;
          their major function is as cantillation marks, to help
          in the chanting and singing of sacred texts. 05C4..05C5
          are more marks used in the annotation of Biblical text,
          and are not part of the regular pointing system for vowels.

Arabic

 omit:    0610..0615, 06D6..06ED

 reason:  0610..0615 are honorific annotations added to names
          in text. 06D6..06ED are annotation marks used in Koranic
          text, again mostly for guidance in chanting and singing
          sacred text. None of these are part of regular orthographies,
          and should not be confused with the harakat used for
          indicating vowels in Arabic.

Syriac

 omit:    0740..074A

 reason:  Again, these are marks used in annotating text, and need
          to be distinguished from the regular vowel marks needed
          for the orthography. There is no need for these annotation marks
          for internet identifiers.

Devanagari

 omit:    0953..0954

 reason:  These are the dubious clones of acute and grave accent
          marks included in the Devanagari block. While not formally
          deprecated, there is no obvious function for them in
          Devanagari, and they are otherwise easily confused with
          the common diacritic acute and grave accent marks.

Tibetan

 omit:    0F18..0F19, 0F35, 0F37, 0F3E..0F3F, 0FC6

 reason:  Some of these are astrological signs, only used for special
          purpose markup of digits (or occasionally other signs) in
          Tibetan astrology. 0F35 and 0F37 are text highlighting
          marks; they are used like underlining. 0FC6 is a
          symbol diacritic, not used with regular Tibetan text.

Khmer

 omit:    17D3

 reason:  This is a deprecated character originally intended as
          part of the formation of lunar date symbols. It is not
          used in regular text.

Mongolian

 omit:    180B..180D

 reason:  These are the Mongolian-specific variation selectors.
          They get automatically removed (by an earlier rule),
          because they are Default_Ignorable_Code_Point. I am
          just cleaning up my list here to match the rules to
          date.

Balinese

 omit:    1B6B..1B73

 reason:  These are combining marks only used in Balinese musical
          notation, rather than in regular text.

Combining Diacritical Marks Supplement

 omit:    1DC0..1DC1, 1DC3

 reason:  1DC0..1DC1 are editorial signs for Ancient Greek, used only
          in academic annotation. 1DC3 is a combining mark for
          Glagolitic, a historic script already omitted from the list.

CJK Symbols and Punctuation

 omit:    302A..302F

 reason:  These are tone mark annotations only used in nonstandard
          annotations of Han characters or Hangul. They are not
          part of either standard CJK orthographies or the commonly
          encountered Latin transliterations for Chinese or Korean.

 omit:    3031..3035, 303B..303C

 reason:  While these are not combining marks, they should also be
          omitted from the inclusions list. 3031..3035 are special
          character forms only appropriate for vertically-rendered
          text and inappropriate for internet identifiers. 303B
          is another vertical rendering form. And 303C is an
          abbreviatory symbol that happens to equate to "masu"
          in Japanese, but is not a part of the regular orthography
          of Japanese.

Combining Half Marks

 omit:    FE20..FE23

 reason:  These are compatibility half forms, used only in the
          mapping of certain legacy bibliographic character encodings.
          They are not appropriate for normal Unicode text representation.

Arabic Presentation Forms-B

 omit:    FE73

 reason:  This is another oddball compatibility character, encoded only
          for transcoding to some old IBM code pages, but which doesn't
          have any compatibility decomposition mapping, and so which
          didn't get filtered by the NFKC(cp) != cp criterion. It should
          simply be omitted by exception here because it is
          inappropriate for use in internet identifiers.
-- 
Mark