Re: Collating nonconjunct and conjunct forms of words

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue May 10 2005 - 12:42:43 CDT

  • Next message: Gregg Reynolds: "Re: Full Unicode Computer Keyboard"

    By default in UCA, the ZWJ and ZWNJ are completely ignorable. So any two
    strings that only differ by those characters will sort next to one another.
    (The one exception to complete ignorability is that they can block
    contractions.)

    200C ; [.0000.0000.0000.0000] # [200C] ZERO WIDTH NON-JOINER
    200D ; [.0000.0000.0000.0000] # [200D] ZERO WIDTH JOINER

    See http://www.unicode.org/Public/UCA/latest/allkeys.txt

    The viramas (halants), on the other hand, are given primary weights,
    typically at the very end of each script, after the vowels. For example:

    094D ; [.1853.0020.0002.094D] # DEVANAGARI SIGN VIRAMA

    That means that the ordering is the following (where C1..Cn are consonants
    (with default vowel); V1..Vm are vowels; and X is virama/halant

    C1
    C1 C1
    ...
    C1 Cn
    C1 V1
    ...
    C1 Vm
    C1 X C1
    C1 X C1 V1
    ...
    C1 X C1 Vn
    ...
    C1 X Cn
    C1 X Cn V1
    ...
    C1 X Cn Vn

    This may be tailored on a per-language basis in CLDR, such as
    http://unicode.org/cldr/data/common/collation/hi.xml

    ‎Mark

    ----- Original Message -----
    From: "N. Ganesan" <naa.ganesan@gmail.com>
    To: "Unicode List" <unicode@unicode.org>
    Sent: Tuesday, May 10, 2005 08:10
    Subject: Collating nonconjunct and conjunct forms of words

    > In Indian languages, ZWJ or ZWNJ are used
    > to produce conjunct and nonconjunct forms
    > of identical words.
    >
    > Interestingly, identical words appear in
    > conjuncts form in some places in a book while
    > nonconjunct forms of the same words appear elsewhere
    > in that particular book. In Tamil, this situation
    > exists for Sanskrit loan words.
    >
    > Also in the first half of 20th century,
    > Islamic names were written with a conjunct ksha,
    > are now universally written with a nonconjunct ksha.
    >
    > Linguistically, it makes sense to place the
    > identical words next to each other while sorting
    > a book words, if that book has the same word
    > has both conjunct and conjunct letters at different places.
    >
    > How does Unicode treat collation of conjunct
    > and nonconjunct forms of identical words?
    > Are they next to each other? Since North Indian
    > languages have possibly this situation many times,
    > any general rule or policy?
    >
    > N. Ganesan
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue May 10 2005 - 12:43:42 CDT