From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Jul 09 2004 - 11:06:56 CDT
I agree with Michael -- diacritic folding is a useful folding to add,
independent of the UCA.
Also, Peter's remark that: "And it is already covered by the Unicode
collation algorithm and default table..." is incorrect. The UCA generally
follows our decompositions in determining many primary weights, and we do
not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I
have felt from the beginning that it was a mistake to not be consistent in
our decompositions -- but that is water under the bridge.] If you look at
John's suggested file for diacritic
folding(http://www.ccil.org/~cowan/DiacriticFolding.txt), there are quite a
number that are not reflected in the UCA. Below is a filter of those
characters in his file that either:
(a) are not the same as folding to nfd & removing combining marks
(b) are not primary equivalents in uca
There is a proposal being worked on to change the UCA primary weights, e.g.,
to give the same primary weights to O and O WITH STROKE, but as of this
point the UCA does not fold the following cases marked "!uca". (Note that
for O and O WITH STROKE this would be the *default* UCA weight ; the CLDR
already tailors O WITH STROKE above Z for a number of languages.)
============
0181; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH HOOK
0182; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH TOPBAR
0187; 0043; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER C WITH HOOK
0110; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH STROKE
018A; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH HOOK
018B; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH TOPBAR
0191; 0046; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER F WITH HOOK
0193; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH HOOK
01E4; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH STROKE
0126; 0048; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER H WITH STROKE
0197; 0049; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER I WITH STROKE
0198; 004B; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER K WITH HOOK
0141; 004C; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER L WITH STROKE
019D; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LEFT HOOK
0220; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LONG RIGHT
LEG
00D8; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE
019F; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH MIDDLE
TILDE
01FE; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE AND
ACUTE
01A4; 0050; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER P WITH HOOK
0166; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH STROKE
01AC; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH HOOK
01AE; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH RETROFLEX
HOOK
01B2; 0056; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER V WITH HOOK
01B3; 0059; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Y WITH HOOK
01B5; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH STROKE
0224; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH HOOK
1E9A; 0061; !nfd+remove_marks; !uca #LATIN SMALL LETTER A WITH RIGHT HALF
RING
0180; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH STROKE
0183; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH TOPBAR
0253; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH HOOK
0188; 0063; !nfd+remove_marks; !uca #LATIN SMALL LETTER C WITH HOOK
0255; 0063; !nfd+remove_marks; !uca #LATIN SMALL LETTER C WITH CURL
0111; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH STROKE
018C; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH TOPBAR
0221; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH CURL
0256; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH TAIL
0257; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH HOOK
0192; 0066; !nfd+remove_marks; !uca #LATIN SMALL LETTER F WITH HOOK
01E5; 0067; !nfd+remove_marks; !uca #LATIN SMALL LETTER G WITH STROKE
0260; 0067; !nfd+remove_marks; !uca #LATIN SMALL LETTER G WITH HOOK
0127; 0068; !nfd+remove_marks; !uca #LATIN SMALL LETTER H WITH STROKE
0266; 0068; !nfd+remove_marks; !uca #LATIN SMALL LETTER H WITH HOOK
0268; 0069; !nfd+remove_marks; !uca #LATIN SMALL LETTER I WITH STROKE
029D; 006A; !nfd+remove_marks; !uca #LATIN SMALL LETTER J WITH CROSSED-TAIL
0199; 006B; !nfd+remove_marks; !uca #LATIN SMALL LETTER K WITH HOOK
0140; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH MIDDLE DOT
0142; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH STROKE
019A; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH BAR
0234; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH CURL
026B; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH MIDDLE TILDE
026C; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH BELT
026D; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH RETROFLEX
HOOK
0271; 006D; !nfd+remove_marks; !uca #LATIN SMALL LETTER M WITH HOOK
019E; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH LONG RIGHT
LEG
0235; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH CURL
0272; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH LEFT HOOK
0273; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH RETROFLEX
HOOK
00F8; 006F; !nfd+remove_marks; !uca #LATIN SMALL LETTER O WITH STROKE
01FF; 006F; !nfd+remove_marks; !uca #LATIN SMALL LETTER O WITH STROKE AND
ACUTE
01A5; 0070; !nfd+remove_marks; !uca #LATIN SMALL LETTER P WITH HOOK
02A0; 0071; !nfd+remove_marks; !uca #LATIN SMALL LETTER Q WITH HOOK
027C; 0072; !nfd+remove_marks; !uca #LATIN SMALL LETTER R WITH LONG LEG
027D; 0072; !nfd+remove_marks; !uca #LATIN SMALL LETTER R WITH TAIL
0282; 0073; !nfd+remove_marks; !uca #LATIN SMALL LETTER S WITH HOOK
0167; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH STROKE
01AB; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH PALATAL HOOK
01AD; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH HOOK
0236; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH CURL
0288; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH RETROFLEX
HOOK
028B; 0076; !nfd+remove_marks; !uca #LATIN SMALL LETTER V WITH HOOK
01B4; 0079; !nfd+remove_marks; !uca #LATIN SMALL LETTER Y WITH HOOK
01B6; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH STROKE
0225; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH HOOK
0290; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH RETROFLEX
HOOK
0291; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH CURL
025A; 0259; !nfd+remove_marks; !uca #LATIN SMALL LETTER SCHWA WITH HOOK
0286; 0283; !nfd+remove_marks; !uca #LATIN SMALL LETTER ESH WITH CURL
01BA; 0292; !nfd+remove_marks; !uca #LATIN SMALL LETTER EZH WITH TAIL
0293; 0292; !nfd+remove_marks; !uca #LATIN SMALL LETTER EZH WITH CURL
04D0; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH BREVE
04D2; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH DIAERESIS
0490; 0413; !nfd+remove_marks; #CYRILLIC CAPITAL LETTER GHE WITH UPTURN
0492; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH STROKE
0494; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH MIDDLE
HOOK
04D6; 0415; ; !uca #CYRILLIC CAPITAL LETTER IE WITH BREVE
0496; 0416; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZHE WITH
DESCENDER
04DC; 0416; ; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
0498; 0417; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZE WITH
DESCENDER
04DE; 0417; ; !uca #CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
04E4; 0418; ; !uca #CYRILLIC CAPITAL LETTER I WITH DIAERESIS
048A; 0419; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER SHORT I WITH
TAIL
049A; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
DESCENDER
049C; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
VERTICAL STROKE
049E; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH STROKE
04C3; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH HOOK
04C5; 041B; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EL WITH TAIL
04CD; 041C; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EM WITH TAIL
04A2; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH
DESCENDER
04C7; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH HOOK
04C9; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH TAIL
04E6; 041E; ; !uca #CYRILLIC CAPITAL LETTER O WITH DIAERESIS
04A6; 041F; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER PE WITH MIDDLE
HOOK
048E; 0420; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ER WITH TICK
04AA; 0421; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ES WITH
DESCENDER
04AC; 0422; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER TE WITH
DESCENDER
04F0; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DIAERESIS
04F2; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
04B2; 0425; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER HA WITH
DESCENDER
04B3; 0425; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER HA WITH DESCENDER
04F4; 0427; ; !uca #CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
04F8; 042B; ; !uca #CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
04EC; 042D; ; !uca #CYRILLIC CAPITAL LETTER E WITH DIAERESIS
04D1; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH BREVE
04D3; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH DIAERESIS
0491; 0433; !nfd+remove_marks; #CYRILLIC SMALL LETTER GHE WITH UPTURN
0493; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH STROKE
0495; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH MIDDLE
HOOK
04D7; 0435; ; !uca #CYRILLIC SMALL LETTER IE WITH BREVE
0497; 0436; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZHE WITH
DESCENDER
04DD; 0436; ; !uca #CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
0499; 0437; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZE WITH DESCENDER
04DF; 0437; ; !uca #CYRILLIC SMALL LETTER ZE WITH DIAERESIS
04E5; 0438; ; !uca #CYRILLIC SMALL LETTER I WITH DIAERESIS
048B; 0439; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER SHORT I WITH TAIL
049B; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH DESCENDER
049D; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH VERTICAL
STROKE
049F; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH STROKE
04C4; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH HOOK
04C6; 043B; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EL WITH TAIL
04CE; 043C; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EM WITH TAIL
04A3; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH DESCENDER
04C8; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH HOOK
04CA; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH TAIL
04E7; 043E; ; !uca #CYRILLIC SMALL LETTER O WITH DIAERESIS
04A7; 043F; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER PE WITH MIDDLE
HOOK
048F; 0440; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ER WITH TICK
04AB; 0441; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ES WITH DESCENDER
04AD; 0442; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER TE WITH DESCENDER
04F1; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DIAERESIS
04F3; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
04B9; 0447; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH VERTICAL
STROKE
04F5; 0447; ; !uca #CYRILLIC SMALL LETTER CHE WITH DIAERESIS
04F9; 044B; ; !uca #CYRILLIC SMALL LETTER YERU WITH DIAERESIS
04ED; 044D; ; !uca #CYRILLIC SMALL LETTER E WITH DIAERESIS
047C; 0460; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER OMEGA WITH
TITLO
047D; 0461; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER OMEGA WITH TITLO
0476; 0474; ; !uca #CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
0477; 0475; ; !uca #CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
04B0; 04AE; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER STRAIGHT U WITH
STROKE
04B1; 04AF; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER STRAIGHT U WITH
STROKE
04B6; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
DESCENDER
04B7; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH
DESCENDER
04B8; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
VERTICAL STROKE
04BE; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ABKHASIAN CHE
WITH DESCENDER
04BF; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ABKHASIAN CHE
WITH DESCENDER
04CB; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
04CC; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KHAKASSIAN CHE
04DA; 04D8; ; !uca #CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS
04DB; 04D9; ; !uca #CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
04EA; 04E8; ; !uca #CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS
04EB; 04E9; ; !uca #CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS
Mark
----- Original Message -----
From: "Michael (michka) Kaplan" <michka@trigeminal.com>
To: <unicode@unicode.org>
Sent: Friday, July 09, 2004 07:40
Subject: Re: Looking for transcription or transliteration standards latin-
>arabic
> From: "Peter Kirk" <peterkirk@qaya.org>
>
> > But Kaplan is referring to something quite different, optionally
> > ignoring diacritics in search operations. This is indeed desirable, so
> > that a single search can match both Dvorak and Dvořák for example, and
> > so that the one doing the search does not need to remember exactly which
> > diacritics are used in the name. And it is already covered by the
> > Unicode collation algorithm and default table, in which diacritics are
> > distinguished only at the second level and so folded by a top level only
> > collation.
>
> (a) If this were true and it were the only need, then case folding would
> also just be "a UCA issue", yet case folding is in the document.
>
> (b) Not everyone uses the UCA who uses Unicode (most of the corporate
> members companies in Unicode -- including IBM -- had alternate collation
> methods that existed prior to the UCA and which to this day support more
> languages, in their databases and operating systems)
>
> (c) Since the operation (diacritic folding) is a valid one that
> implementations may want to do and the UCA is a UTS and thus not required
> for Unicode conformance, it is a sensible folding operation to define.
>
> Does diacritic folding destroy information provided by the distinctions
that
> diacritcs provide? Of course it does. But then again, the same can be said
> of all foldings. This does not diminish their potential usefulness in
> specific tasks/operations.
>
>
> MichKa [MS]
> NLS Collation/Locale/Keyboard Development
> Globalization Infrastructure and Font Technologies
> Windows International Division
>
>
>
This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 11:07:34 CDT