Re: Looking for transcription or transliteration standards latin- >arabic

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Jul 09 2004 - 16:10:38 CDT

  • Next message: Kenneth Whistler: "Re: Changing UCA primary weights (bad idea)"

    On 09/07/2004 17:06, Mark Davis wrote:

    >I agree with Michael -- diacritic folding is a useful folding to add,
    >independent of the UCA.
    >
    >Also, Peter's remark that: "And it is already covered by the Unicode
    >collation algorithm and default table..." is incorrect. ...
    >

    Well, I think this depends on whether the stroke in characters like
    U+00D8 and similar additional marks are considered to be diacritics. I
    am not sure that they are diacritics in the strict sense, and the
    current DUCET mappings don't treat them as such, but John Cowan's list
    does treat them as such.

    >... The UCA generally
    >follows our decompositions in determining many primary weights, and we do
    >not decompose characters like U+00D8 LATIN CAPITAL LETTER O WITH STROKE. [I
    >have felt from the beginning that it was a mistake to not be consistent in
    >our decompositions -- but that is water under the bridge.] If you look at
    >John's suggested file for diacritic
    >folding(http://www.ccil.org/~cowan/DiacriticFolding.txt), ...
    >

    I have just reviewed this list and found it odd that Hebrew presentation
    forms are included but Arabic ones are not. But in fact surely not only
    the Hebrew presentation forms but also most of the precomposed
    characters are redundant in this list. For the basic folding algorithm
    (in http://www.unicode.org/reports/tr30/) is:

    > a. Apply optional folding operations
    > b. Apply canonical decomposition
    > c. Repeat (*a*) and (*b*) until stable
    > d. Apply composition if necessary

    Step (b) will decompose not only presentation forms but also all
    precomposed characters with canonical decompositions, and the combining
    marks will be deleted by the repeat of step (a). It is therefore
    necessary to list in the specification of the folding only all (?)
    combining marks, which are to be deleted, and all precomposed characters
    which do *not* have canonical decompositions. Letters like O with stroke
    are presumably in this latter list, along with many of the listed
    Cyrillic characters.

    But I would suggest some caution about listing for diacritic folding
    some of the Cyrillic characters below, especially those with descenders.
    I note that 0429 is not folded to 0428 etc, and this is correct because
    within the Cyrillic writing system these are entirely separate
    characters. But the difference between these two is in fact exactly the
    same descender which is removed in 0496 etc. I am also surprised to note
    that no folding is given for 0419/0439; although in some ways this is
    desirable because Russians do not consider this breve to be a diacritic
    (and after all we would not want the dot on i to be removed as a
    diacritic!), these characters have canonical decompositions to 0418/0438
    and breve and the principle of canonical equivalence and the folding
    algorithm (which works on decomposed characters) more or less demand
    that the breve be deleted. Also 048A/048B should then fold to 0418/0438
    rather than 0419/0439.

    >...
    >04D0; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH BREVE
    >04D2; 0410; ; !uca #CYRILLIC CAPITAL LETTER A WITH DIAERESIS
    >0490; 0413; !nfd+remove_marks; #CYRILLIC CAPITAL LETTER GHE WITH UPTURN
    >0492; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH STROKE
    >0494; 0413; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER GHE WITH MIDDLE
    >HOOK
    >04D6; 0415; ; !uca #CYRILLIC CAPITAL LETTER IE WITH BREVE
    >0496; 0416; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZHE WITH
    >DESCENDER
    >04DC; 0416; ; !uca #CYRILLIC CAPITAL LETTER ZHE WITH DIAERESIS
    >0498; 0417; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ZE WITH
    >DESCENDER
    >04DE; 0417; ; !uca #CYRILLIC CAPITAL LETTER ZE WITH DIAERESIS
    >04E4; 0418; ; !uca #CYRILLIC CAPITAL LETTER I WITH DIAERESIS
    >048A; 0419; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER SHORT I WITH
    >TAIL
    >049A; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
    >DESCENDER
    >049C; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH
    >VERTICAL STROKE
    >049E; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH STROKE
    >04C3; 041A; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KA WITH HOOK
    >04C5; 041B; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EL WITH TAIL
    >04CD; 041C; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EM WITH TAIL
    >04A2; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH
    >DESCENDER
    >04C7; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH HOOK
    >04C9; 041D; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER EN WITH TAIL
    >04E6; 041E; ; !uca #CYRILLIC CAPITAL LETTER O WITH DIAERESIS
    >04A6; 041F; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER PE WITH MIDDLE
    >HOOK
    >048E; 0420; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ER WITH TICK
    >04AA; 0421; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ES WITH
    >DESCENDER
    >04AC; 0422; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER TE WITH
    >DESCENDER
    >04F0; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DIAERESIS
    >04F2; 0423; ; !uca #CYRILLIC CAPITAL LETTER U WITH DOUBLE ACUTE
    >04B2; 0425; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER HA WITH
    >DESCENDER
    >04B3; 0425; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER HA WITH DESCENDER
    >04F4; 0427; ; !uca #CYRILLIC CAPITAL LETTER CHE WITH DIAERESIS
    >04F8; 042B; ; !uca #CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS
    >04EC; 042D; ; !uca #CYRILLIC CAPITAL LETTER E WITH DIAERESIS
    >04D1; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH BREVE
    >04D3; 0430; ; !uca #CYRILLIC SMALL LETTER A WITH DIAERESIS
    >0491; 0433; !nfd+remove_marks; #CYRILLIC SMALL LETTER GHE WITH UPTURN
    >0493; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH STROKE
    >0495; 0433; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER GHE WITH MIDDLE
    >HOOK
    >04D7; 0435; ; !uca #CYRILLIC SMALL LETTER IE WITH BREVE
    >0497; 0436; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZHE WITH
    >DESCENDER
    >04DD; 0436; ; !uca #CYRILLIC SMALL LETTER ZHE WITH DIAERESIS
    >0499; 0437; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ZE WITH DESCENDER
    >04DF; 0437; ; !uca #CYRILLIC SMALL LETTER ZE WITH DIAERESIS
    >04E5; 0438; ; !uca #CYRILLIC SMALL LETTER I WITH DIAERESIS
    >048B; 0439; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER SHORT I WITH TAIL
    >049B; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH DESCENDER
    >049D; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH VERTICAL
    >STROKE
    >049F; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH STROKE
    >04C4; 043A; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KA WITH HOOK
    >04C6; 043B; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EL WITH TAIL
    >04CE; 043C; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EM WITH TAIL
    >04A3; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH DESCENDER
    >04C8; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH HOOK
    >04CA; 043D; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER EN WITH TAIL
    >04E7; 043E; ; !uca #CYRILLIC SMALL LETTER O WITH DIAERESIS
    >04A7; 043F; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER PE WITH MIDDLE
    >HOOK
    >048F; 0440; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ER WITH TICK
    >04AB; 0441; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ES WITH DESCENDER
    >04AD; 0442; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER TE WITH DESCENDER
    >04F1; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DIAERESIS
    >04F3; 0443; ; !uca #CYRILLIC SMALL LETTER U WITH DOUBLE ACUTE
    >04B9; 0447; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH VERTICAL
    >STROKE
    >04F5; 0447; ; !uca #CYRILLIC SMALL LETTER CHE WITH DIAERESIS
    >04F9; 044B; ; !uca #CYRILLIC SMALL LETTER YERU WITH DIAERESIS
    >04ED; 044D; ; !uca #CYRILLIC SMALL LETTER E WITH DIAERESIS
    >047C; 0460; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER OMEGA WITH
    >TITLO
    >047D; 0461; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER OMEGA WITH TITLO
    >0476; 0474; ; !uca #CYRILLIC CAPITAL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
    >0477; 0475; ; !uca #CYRILLIC SMALL LETTER IZHITSA WITH DOUBLE GRAVE ACCENT
    >04B0; 04AE; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER STRAIGHT U WITH
    >STROKE
    >04B1; 04AF; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER STRAIGHT U WITH
    >STROKE
    >04B6; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
    >DESCENDER
    >04B7; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER CHE WITH
    >DESCENDER
    >04B8; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER CHE WITH
    >VERTICAL STROKE
    >04BE; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER ABKHASIAN CHE
    >WITH DESCENDER
    >04BF; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER ABKHASIAN CHE
    >WITH DESCENDER
    >04CB; 04BC; !nfd+remove_marks; !uca #CYRILLIC CAPITAL LETTER KHAKASSIAN CHE
    >04CC; 04BC; !nfd+remove_marks; !uca #CYRILLIC SMALL LETTER KHAKASSIAN CHE
    >04DA; 04D8; ; !uca #CYRILLIC CAPITAL LETTER SCHWA WITH DIAERESIS
    >04DB; 04D9; ; !uca #CYRILLIC SMALL LETTER SCHWA WITH DIAERESIS
    >04EA; 04E8; ; !uca #CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS
    >04EB; 04E9; ; !uca #CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS
    >
    >

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 16:11:24 CDT