On Thu, Nov 30, 2000 at 03:44:00AM -0800, Branislav Tichy wrote:
> this subject (or alike) has been probably already discussed, but let me
> ask one more question about it: sequences vrs collating
> i have recently read the page //www.unicode.org/unicode/standard/where/
> and i basically agree with listed reasons (for not including all possible
> sequences...) except one. let me explain it on Slovak.
> there actually is example for one possible Slovak sequence (may i call it
> digraph?): 'ch' or 0063 0068. another possibilities are 'dz' 'd3' (d+z
> caron 0064 017e | 0064 007a 030c) 'ia' 'ie' 'iu' 'ou'. the problem is,
> a) they _are not_ sorted as c+h, d+z... when standing for one grapheme
> (the order is ...d,dz,d3,e...h,ch,i,...)
> b) there are compound words, which have these sequences on a word border,
> and in this case, they stands for two separate graphemes and _are_ sorted
> as c+h, d+z a.s.f.
> the proper collation algorithmus would therefore have to realise (imho),
> whether there is one or two graphemes (whether the word is compound)!
> one possible solution could be using codes
> 200b zwsp
> 200c zwnj
> 200d zwj
> to distinguish digraphs (like in the example with fi ligature).
> or maybe there could be some 'digraph gluing' code?
> or maybe code for word border in compound words?
> or it could be handled by 009a (single character introducer) code?
> this way sorting could be done by low-level algorithmus without any need
> for word dictionaries (i can't think of any other mean how to distinguish
> compound word and its parts properly)
I would suggest you use something like SHY soft-hyphen between
the combined words. In that way you also have
an indication on where to hyphenate.
Sorting is well understood for Slovak and special rules have been in
place for these digraphs for a long time.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT