From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Jul 09 2004 - 16:10:38 CDT
On 09/07/2004 17:06, Mark Davis wrote:
>I agree with Michael -- diacritic folding is a useful folding to add,
>independent of the UCA.
>
>Also, Peter's remark that: "And it is already covered by the Unicode
>collation algorithm and default table..." is incorrect. ...
>
Well, I think this depends on whether the stroke in characters like
U+00D8 and similar additional marks are considered to be diacritics. I
am not sure that they are diacritics in the strict sense, and the
current DUCET mappings don't treat them as such, but John Cowan's list
does treat them as such.
>... The UCA generally
>follows our decompositions in determining many primary weights, and we do
>not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I
>have felt from the beginning that it was a mistake to not be consistent in
>our decompositions -- but that is water under the bridge.] If you look at
>John's suggested file for diacritic
>folding(http://www.ccil.org/~cowan/DiacriticFolding.txt), ...
>
I have just reviewed this list and found it odd that Hebrew presentation
forms are included but Arabic ones are not. But in fact surely not only
the Hebrew presentation forms but also most of the precomposed
characters are redundant in this list. For the basic folding algorithm
(in http://www.unicode.org/reports/tr30/) is:
> a. Apply optional folding operations
> b. Apply canonical decomposition
> c. Repeat (*a*) and (*b*) until stable
> d. Apply composition if necessary
Step (b) will decompose not only presentation forms but also all
precomposed characters with canonical decompositions, and the combining
marks will be deleted by the repeat of step (a). It is therefore
necessary to list in the specification of the folding only all (?)
combining marks, which are to be deleted, and all precomposed characters
which do *not* have canonical decompositions. Letters like O with stroke
are presumably in this latter list, along with many of the listed
Cyrillic characters.
But I would suggest some caution about listing for diacritic folding
some of the Cyrillic characters below, especially those with descenders.
I note that 0429 is not folded to 0428 etc, and this is correct because
within the Cyrillic writing system these are entirely separate
characters. But the difference between these two is in fact exactly the
same descender which is removed in 0496 etc. I am also surprised to note
that no folding is given for 0419/0439; although in some ways this is
desirable because Russians do not consider this breve to be a diacritic
(and after all we would not want the dot on i to be removed as a
diacritic!), these characters have canonical decompositions to 0418/0438
and breve and the principle of canonical equivalence and the folding
algorithm (which works on decomposed characters) more or less demand
that the breve be deleted. Also 048A/048B should then fold to 0418/0438
rather than 0419/0439.
>...
>04D0; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH BREVE
>04D2; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH DIAERESIS
>0490; 0413; !nfd+remove_marks; #CYRILLIC CAPITAL LETTER GHE WITH UPTURN
>0492; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH STROKE
>0494; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH MIDDLE
>HOOK
>04D6; 0415; ; !uca #CYRILLIC CAPITAL LETTER IE WITH BREVE
>0496; 0416; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZHE WITH
>DESCENDER
>04DC; 0416; ; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
>0498; 0417; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZE WITH
>DESCENDER
>04DE; 0417; ; !uca #CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
>04E4; 0418; ; !uca #CYRILLIC CAPITAL LETTER I WITH DIAERESIS
>048A; 0419; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER SHORT I WITH
>TAIL
>049A; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
>DESCENDER
>049C; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
>VERTICAL STROKE
>049E; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH STROKE
>04C3; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH HOOK
>04C5; 041B; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EL WITH TAIL
>04CD; 041C; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EM WITH TAIL
>04A2; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH
>DESCENDER
>04C7; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH HOOK
>04C9; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH TAIL
>04E6; 041E; ; !uca #CYRILLIC CAPITAL LETTER O WITH DIAERESIS
>04A6; 041F; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER PE WITH MIDDLE
>HOOK
>048E; 0420; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ER WITH TICK
>04AA; 0421; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ES WITH
>DESCENDER
>04AC; 0422; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER TE WITH
>DESCENDER
>04F0; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DIAERESIS
>04F2; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
>04B2; 0425; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER HA WITH
>DESCENDER
>04B3; 0425; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER HA WITH DESCENDER
>04F4; 0427; ; !uca #CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
>04F8; 042B; ; !uca #CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
>04EC; 042D; ; !uca #CYRILLIC CAPITAL LETTER E WITH DIAERESIS
>04D1; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH BREVE
>04D3; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH DIAERESIS
>0491; 0433; !nfd+remove_marks; #CYRILLIC SMALL LETTER GHE WITH UPTURN
>0493; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH STROKE
>0495; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH MIDDLE
>HOOK
>04D7; 0435; ; !uca #CYRILLIC SMALL LETTER IE WITH BREVE
>0497; 0436; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZHE WITH
>DESCENDER
>04DD; 0436; ; !uca #CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
>0499; 0437; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZE WITH DESCENDER
>04DF; 0437; ; !uca #CYRILLIC SMALL LETTER ZE WITH DIAERESIS
>04E5; 0438; ; !uca #CYRILLIC SMALL LETTER I WITH DIAERESIS
>048B; 0439; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER SHORT I WITH TAIL
>049B; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH DESCENDER
>049D; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH VERTICAL
>STROKE
>049F; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH STROKE
>04C4; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH HOOK
>04C6; 043B; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EL WITH TAIL
>04CE; 043C; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EM WITH TAIL
>04A3; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH DESCENDER
>04C8; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH HOOK
>04CA; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH TAIL
>04E7; 043E; ; !uca #CYRILLIC SMALL LETTER O WITH DIAERESIS
>04A7; 043F; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER PE WITH MIDDLE
>HOOK
>048F; 0440; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ER WITH TICK
>04AB; 0441; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ES WITH DESCENDER
>04AD; 0442; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER TE WITH DESCENDER
>04F1; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DIAERESIS
>04F3; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
>04B9; 0447; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH VERTICAL
>STROKE
>04F5; 0447; ; !uca #CYRILLIC SMALL LETTER CHE WITH DIAERESIS
>04F9; 044B; ; !uca #CYRILLIC SMALL LETTER YERU WITH DIAERESIS
>04ED; 044D; ; !uca #CYRILLIC SMALL LETTER E WITH DIAERESIS
>047C; 0460; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER OMEGA WITH
>TITLO
>047D; 0461; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER OMEGA WITH TITLO
>0476; 0474; ; !uca #CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
>0477; 0475; ; !uca #CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
>04B0; 04AE; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER STRAIGHT U WITH
>STROKE
>04B1; 04AF; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER STRAIGHT U WITH
>STROKE
>04B6; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
>DESCENDER
>04B7; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH
>DESCENDER
>04B8; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
>VERTICAL STROKE
>04BE; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ABKHASIAN CHE
>WITH DESCENDER
>04BF; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ABKHASIAN CHE
>WITH DESCENDER
>04CB; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
>04CC; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KHAKASSIAN CHE
>04DA; 04D8; ; !uca #CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS
>04DB; 04D9; ; !uca #CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
>04EA; 04E8; ; !uca #CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS
>04EB; 04E9; ; !uca #CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS
>
>
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 16:11:24 CDT