Telugu U+0C48, UAX#15 decomposition/canonical recomposition, and identifier-start exclusions

From: Philippe Verdy (
Date: Sat Nov 22 2003 - 21:43:26 EST

  • Next message: Jungshik Shin: "Re: Ternary search trees for Unicode dictionaries"

    I note that the following character is the only one in the 4.0.1 UCD which
    is a combining character of class 0 and is however canonically decomposable
    but not excluded from recomposition:

    <0C48;TELUGU VOWEL SIGN AI;Mn;0;NSM;0C46 0C56;;;;N;;;;;>

    Note that its Bidi behavior is also correctly NSM for non-spacing marks. Its
    full canonical decomposition is also these two non-spacing marks:

    <0C46;TELUGU VOWEL SIGN E;Mn;0;NSM;;;;N;;;;;>
    <0C56;TELUGU AI LENGTH MARK;Mn;91;NSM;;;;N;;;;;>

    Probably the <0C48;TELUGU VOWEL SIGN AI;Mn;0;...> should have been excluded
    from composition, but this is now impossible due to normalized forms

    Also I can't figure out why the Annex 7 of UAX#15 (normalization) do not
    list these two canonical-starter vowel signs as <identifier_extend> instead
    of <identifier_start> along with the four other indicated combining-like
    letters, like <0EB3;LAO VOWEL SIGN AM;Lo;0;...>

    I must have missed something about gc=Mn Telugu vowel signs (and with its
    four gc=Mc ones: U, UU, vocalic R and RR), and why they are given general
    categories in Mn instead of Mc in other Indic scripts, (The compaitiblity
    interactions with ISCII seems quite strange with Telugu.)

    Well I can use it the way it is defined, but I fear some problems here with
    the unique feature of this recomposable character.

    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE!

    This archive was generated by hypermail 2.1.5 : Sat Nov 22 2003 - 22:28:50 EST