Re: Looking for transcription or transliteration standards latin- >arabic

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Jul 09 2004 - 11:06:56 CDT

  • Next message: Michael Everson: "Changing UCA primarly weights (bad idea)"

    I agree with Michael -- diacritic folding is a useful folding to add,
    independent of the UCA.

    Also, Peter's remark that: "And it is already covered by the Unicode
    collation algorithm and default table..." is incorrect. The UCA generally
    follows our decompositions in determining many primary weights, and we do
    not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I
    have felt from the beginning that it was a mistake to not be consistent in
    our decompositions -- but that is water under the bridge.] If you look at
    John's suggested file for diacritic
    folding(http://www.ccil.org/~cowan/DiacriticFolding.txt), there are quite a
    number that are not reflected in the UCA. Below is a filter of those
    characters in his file that either:

    (a) are not the same as folding to nfd & removing combining marks
    (b) are not primary equivalents in uca

    There is a proposal being worked on to change the UCA primary weights, e.g.,
    to give the same primary weights to O and O WITH STROKE, but as of this
    point the UCA does not fold the following cases marked "!uca". (Note that
    for O and O WITH STROKE this would be the *default* UCA weight ; the CLDR
    already tailors O WITH STROKE above Z for a number of languages.)

    ============

    0181; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH HOOK
    0182; 0042; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER B WITH TOPBAR
    0187; 0043; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER C WITH HOOK
    0110; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH STROKE
    018A; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH HOOK
    018B; 0044; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER D WITH TOPBAR
    0191; 0046; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER F WITH HOOK
    0193; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH HOOK
    01E4; 0047; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER G WITH STROKE
    0126; 0048; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER H WITH STROKE
    0197; 0049; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER I WITH STROKE
    0198; 004B; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER K WITH HOOK
    0141; 004C; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER L WITH STROKE
    019D; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LEFT HOOK
    0220; 004E; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER N WITH LONG RIGHT
    LEG
    00D8; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE
    019F; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH MIDDLE
    TILDE
    01FE; 004F; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER O WITH STROKE AND
    ACUTE
    01A4; 0050; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER P WITH HOOK
    0166; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH STROKE
    01AC; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH HOOK
    01AE; 0054; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER T WITH RETROFLEX
    HOOK
    01B2; 0056; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER V WITH HOOK
    01B3; 0059; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Y WITH HOOK
    01B5; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH STROKE
    0224; 005A; !nfd+remove_marks; !uca #LATIN CAPITAL LETTER Z WITH HOOK
    1E9A; 0061; !nfd+remove_marks; !uca #LATIN SMALL LETTER A WITH RIGHT HALF
    RING
    0180; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH STROKE
    0183; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH TOPBAR
    0253; 0062; !nfd+remove_marks; !uca #LATIN SMALL LETTER B WITH HOOK
    0188; 0063; !nfd+remove_marks; !uca #LATIN SMALL LETTER C WITH HOOK
    0255; 0063; !nfd+remove_marks; !uca #LATIN SMALL LETTER C WITH CURL
    0111; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH STROKE
    018C; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH TOPBAR
    0221; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH CURL
    0256; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH TAIL
    0257; 0064; !nfd+remove_marks; !uca #LATIN SMALL LETTER D WITH HOOK
    0192; 0066; !nfd+remove_marks; !uca #LATIN SMALL LETTER F WITH HOOK
    01E5; 0067; !nfd+remove_marks; !uca #LATIN SMALL LETTER G WITH STROKE
    0260; 0067; !nfd+remove_marks; !uca #LATIN SMALL LETTER G WITH HOOK
    0127; 0068; !nfd+remove_marks; !uca #LATIN SMALL LETTER H WITH STROKE
    0266; 0068; !nfd+remove_marks; !uca #LATIN SMALL LETTER H WITH HOOK
    0268; 0069; !nfd+remove_marks; !uca #LATIN SMALL LETTER I WITH STROKE
    029D; 006A; !nfd+remove_marks; !uca #LATIN SMALL LETTER J WITH CROSSED-TAIL
    0199; 006B; !nfd+remove_marks; !uca #LATIN SMALL LETTER K WITH HOOK
    0140; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH MIDDLE DOT
    0142; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH STROKE
    019A; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH BAR
    0234; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH CURL
    026B; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH MIDDLE TILDE
    026C; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH BELT
    026D; 006C; !nfd+remove_marks; !uca #LATIN SMALL LETTER L WITH RETROFLEX
    HOOK
    0271; 006D; !nfd+remove_marks; !uca #LATIN SMALL LETTER M WITH HOOK
    019E; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH LONG RIGHT
    LEG
    0235; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH CURL
    0272; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH LEFT HOOK
    0273; 006E; !nfd+remove_marks; !uca #LATIN SMALL LETTER N WITH RETROFLEX
    HOOK
    00F8; 006F; !nfd+remove_marks; !uca #LATIN SMALL LETTER O WITH STROKE
    01FF; 006F; !nfd+remove_marks; !uca #LATIN SMALL LETTER O WITH STROKE AND
    ACUTE
    01A5; 0070; !nfd+remove_marks; !uca #LATIN SMALL LETTER P WITH HOOK
    02A0; 0071; !nfd+remove_marks; !uca #LATIN SMALL LETTER Q WITH HOOK
    027C; 0072; !nfd+remove_marks; !uca #LATIN SMALL LETTER R WITH LONG LEG
    027D; 0072; !nfd+remove_marks; !uca #LATIN SMALL LETTER R WITH TAIL
    0282; 0073; !nfd+remove_marks; !uca #LATIN SMALL LETTER S WITH HOOK
    0167; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH STROKE
    01AB; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH PALATAL HOOK
    01AD; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH HOOK
    0236; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH CURL
    0288; 0074; !nfd+remove_marks; !uca #LATIN SMALL LETTER T WITH RETROFLEX
    HOOK
    028B; 0076; !nfd+remove_marks; !uca #LATIN SMALL LETTER V WITH HOOK
    01B4; 0079; !nfd+remove_marks; !uca #LATIN SMALL LETTER Y WITH HOOK
    01B6; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH STROKE
    0225; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH HOOK
    0290; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH RETROFLEX
    HOOK
    0291; 007A; !nfd+remove_marks; !uca #LATIN SMALL LETTER Z WITH CURL
    025A; 0259; !nfd+remove_marks; !uca #LATIN SMALL LETTER SCHWA WITH HOOK
    0286; 0283; !nfd+remove_marks; !uca #LATIN SMALL LETTER ESH WITH CURL
    01BA; 0292; !nfd+remove_marks; !uca #LATIN SMALL LETTER EZH WITH TAIL
    0293; 0292; !nfd+remove_marks; !uca #LATIN SMALL LETTER EZH WITH CURL
    04D0; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH BREVE
    04D2; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH DIAERESIS
    0490; 0413; !nfd+remove_marks; #CYRILLIC CAPITAL LETTER GHE WITH UPTURN
    0492; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH STROKE
    0494; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH MIDDLE
    HOOK
    04D6; 0415; ; !uca #CYRILLIC CAPITAL LETTER IE WITH BREVE
    0496; 0416; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZHE WITH
    DESCENDER
    04DC; 0416; ; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
    0498; 0417; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZE WITH
    DESCENDER
    04DE; 0417; ; !uca #CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
    04E4; 0418; ; !uca #CYRILLIC CAPITAL LETTER I WITH DIAERESIS
    048A; 0419; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER SHORT I WITH
    TAIL
    049A; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
    DESCENDER
    049C; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
    VERTICAL STROKE
    049E; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH STROKE
    04C3; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH HOOK
    04C5; 041B; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EL WITH TAIL
    04CD; 041C; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EM WITH TAIL
    04A2; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH
    DESCENDER
    04C7; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH HOOK
    04C9; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH TAIL
    04E6; 041E; ; !uca #CYRILLIC CAPITAL LETTER O WITH DIAERESIS
    04A6; 041F; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER PE WITH MIDDLE
    HOOK
    048E; 0420; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ER WITH TICK
    04AA; 0421; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ES WITH
    DESCENDER
    04AC; 0422; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER TE WITH
    DESCENDER
    04F0; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DIAERESIS
    04F2; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
    04B2; 0425; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER HA WITH
    DESCENDER
    04B3; 0425; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER HA WITH DESCENDER
    04F4; 0427; ; !uca #CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
    04F8; 042B; ; !uca #CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
    04EC; 042D; ; !uca #CYRILLIC CAPITAL LETTER E WITH DIAERESIS
    04D1; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH BREVE
    04D3; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH DIAERESIS
    0491; 0433; !nfd+remove_marks; #CYRILLIC SMALL LETTER GHE WITH UPTURN
    0493; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH STROKE
    0495; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH MIDDLE
    HOOK
    04D7; 0435; ; !uca #CYRILLIC SMALL LETTER IE WITH BREVE
    0497; 0436; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZHE WITH
    DESCENDER
    04DD; 0436; ; !uca #CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
    0499; 0437; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZE WITH DESCENDER
    04DF; 0437; ; !uca #CYRILLIC SMALL LETTER ZE WITH DIAERESIS
    04E5; 0438; ; !uca #CYRILLIC SMALL LETTER I WITH DIAERESIS
    048B; 0439; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER SHORT I WITH TAIL
    049B; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH DESCENDER
    049D; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH VERTICAL
    STROKE
    049F; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH STROKE
    04C4; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH HOOK
    04C6; 043B; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EL WITH TAIL
    04CE; 043C; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EM WITH TAIL
    04A3; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH DESCENDER
    04C8; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH HOOK
    04CA; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH TAIL
    04E7; 043E; ; !uca #CYRILLIC SMALL LETTER O WITH DIAERESIS
    04A7; 043F; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER PE WITH MIDDLE
    HOOK
    048F; 0440; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ER WITH TICK
    04AB; 0441; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ES WITH DESCENDER
    04AD; 0442; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER TE WITH DESCENDER
    04F1; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DIAERESIS
    04F3; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
    04B9; 0447; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH VERTICAL
    STROKE
    04F5; 0447; ; !uca #CYRILLIC SMALL LETTER CHE WITH DIAERESIS
    04F9; 044B; ; !uca #CYRILLIC SMALL LETTER YERU WITH DIAERESIS
    04ED; 044D; ; !uca #CYRILLIC SMALL LETTER E WITH DIAERESIS
    047C; 0460; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER OMEGA WITH
    TITLO
    047D; 0461; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER OMEGA WITH TITLO
    0476; 0474; ; !uca #CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
    0477; 0475; ; !uca #CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
    04B0; 04AE; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER STRAIGHT U WITH
    STROKE
    04B1; 04AF; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER STRAIGHT U WITH
    STROKE
    04B6; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
    DESCENDER
    04B7; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH
    DESCENDER
    04B8; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
    VERTICAL STROKE
    04BE; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ABKHASIAN CHE
    WITH DESCENDER
    04BF; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ABKHASIAN CHE
    WITH DESCENDER
    04CB; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
    04CC; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KHAKASSIAN CHE
    04DA; 04D8; ; !uca #CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS
    04DB; 04D9; ; !uca #CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
    04EA; 04E8; ; !uca #CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS
    04EB; 04E9; ; !uca #CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS

    ‎Mark

    ----- Original Message -----
    From: "Michael (michka) Kaplan" <michka@trigeminal.com>
    To: <unicode@unicode.org>
    Sent: Friday, July 09, 2004 07:40
    Subject: Re: Looking for transcription or transliteration standards latin-
    >arabic

    > From: "Peter Kirk" <peterkirk@qaya.org>
    >
    > > But Kaplan is referring to something quite different, optionally
    > > ignoring diacritics in search operations. This is indeed desirable, so
    > > that a single search can match both Dvorak and Dvořák for example, and
    > > so that the one doing the search does not need to remember exactly which
    > > diacritics are used in the name. And it is already covered by the
    > > Unicode collation algorithm and default table, in which diacritics are
    > > distinguished only at the second level and so folded by a top level only
    > > collation.
    >
    > (a) If this were true and it were the only need, then case folding would
    > also just be "a UCA issue", yet case folding is in the document.
    >
    > (b) Not everyone uses the UCA who uses Unicode (most of the corporate
    > members companies in Unicode -- including IBM -- had alternate collation
    > methods that existed prior to the UCA and which to this day support more
    > languages, in their databases and operating systems)
    >
    > (c) Since the operation (diacritic folding) is a valid one that
    > implementations may want to do and the UCA is a UTS and thus not required
    > for Unicode conformance, it is a sensible folding operation to define.
    >
    > Does diacritic folding destroy information provided by the distinctions
    that
    > diacritcs provide? Of course it does. But then again, the same can be said
    > of all foldings. This does not diminish their potential usefulness in
    > specific tasks/operations.
    >
    >
    > MichKa [MS]
    > NLS Collation/Locale/Keyboard Development
    > Globalization Infrastructure and Font Technologies
    > Windows International Division
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 11:07:34 CDT