sequences and stuff

From: Branislav Tichy (tichy6@kepler.fmph.uniba.sk)
Date: Thu Nov 30 2000 - 07:03:06 EST


hello,

this subject (or alike) has been probably already discussed, but let me
ask one more question about it: sequences vrs collating
i have recently read the page //www.unicode.org/unicode/standard/where/
and i basically agree with listed reasons (for not including all possible
sequences...) except one. let me explain it on Slovak.
there actually is example for one possible Slovak sequence (may i call it
digraph?): 'ch' or 0063 0068. another possibilities are 'dz' 'd3' (d+z
caron 0064 017e | 0064 007a 030c) 'ia' 'ie' 'iu' 'ou'. the problem is,
that
a) they _are not_ sorted as c+h, d+z... when standing for one grapheme
(the order is ...d,dz,d3,e...h,ch,i,...)
b) there are compound words, which have these sequences on a word border,
and in this case, they stands for two separate graphemes and _are_ sorted
as c+h, d+z a.s.f.
the proper collation algorithmus would therefore have to realise (imho),
whether there is one or two graphemes (whether the word is compound)!

suggestion:
one possible solution could be using codes
        200b zwsp
or
        200c zwnj
        200d zwj
to distinguish digraphs (like in the example with fi ligature).

or maybe there could be some 'digraph gluing' code?
or maybe code for word border in compound words?
or it could be handled by 009a (single character introducer) code?

this way sorting could be done by low-level algorithmus without any need
for word dictionaries (i can't think of any other mean how to distinguish
compound word and its parts properly)

brano

(0062 0072 0061 0148 006f)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT