FW: ZWNJ & Persian Collation

From: Magda Danish \(Unicode\) (v-magdad@microsoft.com)
Date: Tue Mar 11 2003 - 13:28:28 EST

  • Next message: Timothy Partridge: "Re: Encoding: Unicode Quarterly Newsletter"

    Please make sure to copy Vladimiriranorus@online.ru on your reply.
    Thanks,
    Magda

    > -----Original Message-----
    > From: Vladimir Ivanov [mailto:iranorus@online.ru]
    > Sent: Tuesday, March 11, 2003 6:22 AM
    > To: Magda Danish (Unicode)
    > Subject: ZWNJ & Persian Collation
    >
    >
    > Dear Magda,
    >
    > Excuse for bothering you again, but my message was rejected
    > by some server
    > on its way to unicode@unicode.org . May I ask you to publish
    > my question
    > below? Thank you, Vladimir.
    >
    >
    >
    > Sorting Persian words with a utility, based on version 3.1.1
    > of tailored
    > Allkeys Table http://www.unicode.org/reports/tr10/#AllKeys,
    > I’ve encountered
    > a problem that affects the lexicographical order of the words in a
    > dictionary.
    >
    > To my mind, ZWNJ (zero width non-joiner) U+200C (also found
    > among MS Word
    > Special Characters/No-width Optional Break), was invented to prevent
    > connection of Arabic letters within a word.
    >
    > It is used in Persian to show the morphemic boundary in
    > compound words like
    > خانه‌داری xānedāri ‘household’. The latter consists of the
    > word خانه xāne
    > ‘house’ + verb stem دار dār ‘hold’ + suffix ی ‘i’. It can be
    > transliterated
    > like xāne + ZWNJ + dāri. There are thousands words with
    > similar structure in
    > Persian, Dari, Tajik and neighboring languages.
    >
    > It is clearly seen that there are letters on both sides of
    > ZWNJ within the
    > word boundaries. Placing ZWNJ on an edge of the word doesn’t
    > make sense in
    > Persian. From this point of view ZWNJ should be treated as a special
    > character rather than a delimiter.
    >
    > But in Allkeys Table it is placed on line #68 well before
    > other popular
    > delimiters: HORIZONTAL TABULATION line #192,
    >
    > LINE FEED line #193,
    >
    > CARRIAGE RETURN line #196,
    >
    > SPACE line #197 etc.
    >
    > Such an ordering gives wrong sorting results for Persian dictionaries:
    > compound words like خانه‌داری xānedāri ‘household’ appear in
    > the list before
    > their components like خانه xāne ‘house’.
    >
    > I’ve sold this problem for myself by placing ZWNJ somewhere after
    > delimiters, but what are the theoretical reasons for putting
    > it before them?
    > In order to get what? In what languages?
    >
    > Is it a Persian specific problem or a global one? Are there
    > languages where
    > ZWNJ marks a word boundary?
    >
    > By the way, the sorting algorithm built into MS Windows puts
    > compound words
    > with ZWNJ AFTER their simple components. So in this respect
    > it acts on the
    > principles different from Allkeys Table.
    >
    >
    >
    > Thank you,
    >
    > Vladimir Ivanov
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Mar 11 2003 - 15:34:18 EST