From: spir (email@example.com)
Date: Wed Oct 13 2010 - 06:02:46 CDT
I would like to clarify a point which, I guess, is rather related to UCS's character set than to Unicode properly speaking.
Say I have a sequence of codes, each representing an UCS "abstract character", itself beeing the representation of a text. The text is the french word "Ã¢me" (soul). Then, the sequence may be either (61 302 6D 66) or (E2 6D 66), the latter using a precomposed 'Ã¢'. Both are valid and both represent the same text. Correct?
How is a function supposed to return the first "true character", in the common sense, meaning what is sometimes called "grapheme" in Unicode literature? From the first sequence, it would indeed wrongly return 'a'. Without a rather sophisticated analyse of UCS code combinations (itself requiring knowledge about world scripting systems), there is no chance for such a routine to automagically combine codes. Correct?
Returning 'a' from the first sequence makes no sense, since it is only a part of a composite character (letter). [Actually, it may make sense eg for a linguistc app counting occurrences of 'a'-based characters -- but this is a rather specific need. From the second sequence, the function would indeed return 'Ã¢', which is correct; but only by chance, so to say, just because in this case the code happens to represent a whole character (grapheme). What can we do with a function returning isolated codes that, in the general case, represent parts of whole characters (what I call "marks")? Since one cannot guess whether a code alone represented part or whole of a grapheme in the original text, in my sense, nothing. Am I right on this?
Now, say a function returns the (first) position of a given grapheme, else fails. Using it to search 'a', then, if the 'Ã¢' is decomposed, it will wrongly find 'a' as first character. Correct?
If I use it to search 'Ã¢', a rather interesting situation emerges, I guess. If 'Ã¢' happens to be expressed in the same form (de- or pre- composed in both cases) in input text and in source code, then the function finds it; else it fails. Strange, no? What do you think? It means, if I am right, that to run such a routine we need to produce canonical forms of both the input and the parameter. Then, we can could can safely compare expressions, so to say, in the same language. Correct?
If you python, see below a little script illustrating the issues.
Any comment, critic, or pointer welcome, thank you,
t1 = u"\u00E2me" # "Ã¢me", using precomposed character form for 'Ã¢'
t2 = u"\u0061\u0302me" # "Ã¢me", using decomposed character form for 'Ã¢'
print "%s %s\t%s %s\t%s %s\t%s %s\t%s %s\t%s %s\t%s %s" %(
t1 , t2 ,
len(t1) , len(t2) ,
t1.find(u'a') , t2.find(u'a') ,
t1.find(u'Ã¢') , t2.find(u'Ã¢') ,
t1.find(u"\u00E2") , t2.find(u"\u00E2") ,
t1.find(u"\u0061\u0302") , t2.find(u"\u0061\u0302") ,
# --> Ã¢me aÌ‚me 1 2 Ã¢ a -1 0 0 -1 0 -1 -1 0
-- -- -- -- -- -- --
vit esse estrany â˜£
This archive was generated by hypermail 2.1.5 : Wed Oct 13 2010 - 09:54:00 CDT