L2/07-021 Date: Tue, 16 Jan 2007 Source: Mark Davis Subject: Customary_Use Property ================= Based on discussions on the idna-update@alvestrand.no list, it appears that we will need a property something like the following, so here is a draft for discussion at the UTC.* *Property: Customary_Use=True/False Meaning: characters that are required for the customary orthographies of modern languages. Excludes historic characters, annotation characters, astrological signs, deprecated characters, musical notation, vertical presentation forms, compatibility characters. Draft values: True for all letter, mark, number characters and joiner controls, except for the following: Exclude the following Scripts: Xsux, Ugar, Xpeo, Goth, Ital, Cprt, Linb, Phnx, Khar, Phag, Glag, Shaw, Dsrt, Runr Exclude the following blocks: Combining_Diacritical_Marks_for_Symbols, Musical_Symbols, Ancient_Greek_Musical_Notation Exclude the following ranges of characters (a copy from email from Ken): Common Diacritics omit: 0363..036F reason: These Latin letters above are specialist medievalist usage for manuscripts, and are not a part of regular orthographies. They would also be quite confusing for internet identifiers. Hebrew omit: 0591..05AF, 05C4..05C5 reason: 0591..05AF are the Hebrew accent marks Cary was talking about; their major function is as cantillation marks, to help in the chanting and singing of sacred texts. 05C4..05C5 are more marks used in the annotation of Biblical text, and are not part of the regular pointing system for vowels. Arabic omit: 0610..0615, 06D6..06ED reason: 0610..0615 are honorific annotations added to names in text. 06D6..06ED are annotation marks used in Koranic text, again mostly for guidance in chanting and singing sacred text. None of these are part of regular orthographies, and should not be confused with the harakat used for indicating vowels in Arabic. Syriac omit: 0740..074A reason: Again, these are marks used in annotating text, and need to be distinguished from the regular vowel marks needed for the orthography. There is no need for these annotation marks for internet identifiers. Devanagari omit: 0953..0954 reason: These are the dubious clones of acute and grave accent marks included in the Devanagari block. While not formally deprecated, there is no obvious function for them in Devanagari, and they are otherwise easily confused with the common diacritic acute and grave accent marks. Tibetan omit: 0F18..0F19, 0F35, 0F37, 0F3E..0F3F, 0FC6 reason: Some of these are astrological signs, only used for special purpose markup of digits (or occasionally other signs) in Tibetan astrology. 0F35 and 0F37 are text highlighting marks; they are used like underlining. 0FC6 is a symbol diacritic, not used with regular Tibetan text. Khmer omit: 17D3 reason: This is a deprecated character originally intended as part of the formation of lunar date symbols. It is not used in regular text. Mongolian omit: 180B..180D reason: These are the Mongolian-specific variation selectors. They get automatically removed (by an earlier rule), because they are Default_Ignorable_Code_Point. I am just cleaning up my list here to match the rules to date. Balinese omit: 1B6B..1B73 reason: These are combining marks only used in Balinese musical notation, rather than in regular text. Combining Diacritical Marks Supplement omit: 1DC0..1DC1, 1DC3 reason: 1DC0..1DC1 are editorial signs for Ancient Greek, used only in academic annotation. 1DC3 is a combining mark for Glagolitic, a historic script already omitted from the list. CJK Symbols and Punctuation omit: 302A..302F reason: These are tone mark annotations only used in nonstandard annotations of Han characters or Hangul. They are not part of either standard CJK orthographies or the commonly encountered Latin transliterations for Chinese or Korean. omit: 3031..3035, 303B..303C reason: While these are not combining marks, they should also be omitted from the inclusions list. 3031..3035 are special character forms only appropriate for vertically-rendered text and inappropriate for internet identifiers. 303B is another vertical rendering form. And 303C is an abbreviatory symbol that happens to equate to "masu" in Japanese, but is not a part of the regular orthography of Japanese. Combining Half Marks omit: FE20..FE23 reason: These are compatibility half forms, used only in the mapping of certain legacy bibliographic character encodings. They are not appropriate for normal Unicode text representation. Arabic Presentation Forms-B omit: FE73 reason: This is another oddball compatibility character, encoded only for transcoding to some old IBM code pages, but which doesn't have any compatibility decomposition mapping, and so which didn't get filtered by the NFKC(cp) != cp criterion. It should simply be omitted by exception here because it is inappropriate for use in internet identifiers. -- Mark