From: Phillips, Addison (firstname.lastname@example.org)
Date: Wed Oct 13 2010 - 10:32:18 CDT
UAX #29 (Unicode Text Segmentation) discusses this at length. See especially the section on grapheme cluster boundaries:
Certainly a function that returns first code point of a string is different from one that finds the first grapheme cluster boundary. Sometimes you need one function and sometimes the other and it is important to know the difference.
Your illustrating code also illustrates the problem of Unicode Normalization. Sometimes you may wish to include normalization as part of text processing to deal with cases such as this. See UAX#15 for more details.
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)
Internationalization is not a feature.
It is an architecture.
> -----Original Message-----
> From: email@example.com [mailto:firstname.lastname@example.org]
> On Behalf Of spir
> Sent: Wednesday, October 13, 2010 4:03 AM
> To: unicode
> Subject: composite graphemes
> I would like to clarify a point which, I guess, is rather related
> to UCS's character set than to Unicode properly speaking.
> Say I have a sequence of codes, each representing an UCS "abstract
> character", itself beeing the representation of a text. The text is
> the french word "âme" (soul). Then, the sequence may be either (61
> 302 6D 66) or (E2 6D 66), the latter using a precomposed 'â'. Both
> are valid and both represent the same text. Correct?
> How is a function supposed to return the first "true character", in
> the common sense, meaning what is sometimes called "grapheme" in
> Unicode literature? From the first sequence, it would indeed
> wrongly return 'a'. Without a rather sophisticated analyse of UCS
> code combinations (itself requiring knowledge about world scripting
> systems), there is no chance for such a routine to automagically
> combine codes. Correct?
> Returning 'a' from the first sequence makes no sense, since it is
> only a part of a composite character (letter). [Actually, it may
> make sense eg for a linguistc app counting occurrences of 'a'-based
> characters -- but this is a rather specific need. From the second
> sequence, the function would indeed return 'â', which is correct;
> but only by chance, so to say, just because in this case the code
> happens to represent a whole character (grapheme). What can we do
> with a function returning isolated codes that, in the general case,
> represent parts of whole characters (what I call "marks")? Since
> one cannot guess whether a code alone represented part or whole of
> a grapheme in the original text, in my sense, nothing. Am I right
> on this?
> Now, say a function returns the (first) position of a given
> grapheme, else fails. Using it to search 'a', then, if the 'â' is
> decomposed, it will wrongly find 'a' as first character. Correct?
> If I use it to search 'â', a rather interesting situation emerges,
> I guess. If 'â' happens to be expressed in the same form (de- or
> pre- composed in both cases) in input text and in source code, then
> the function finds it; else it fails. Strange, no? What do you
> think? It means, if I am right, that to run such a routine we need
> to produce canonical forms of both the input and the parameter.
> Then, we can could can safely compare expressions, so to say, in
> the same language. Correct?
> If you python, see below a little script illustrating the issues.
> Any comment, critic, or pointer welcome, thank you,
> t1 = u"\u00E2me" # "âme", using precomposed character form
> for 'â'
> t2 = u"\u0061\u0302me" # "âme", using decomposed character form
> for 'â'
> print "%s %s\t%s %s\t%s %s\t%s %s\t%s %s\t%s %s\t%s %s" %(
> t1 , t2 ,
> len(t1) , len(t2) ,
> t1,t2 ,
> t1.find(u'a') , t2.find(u'a') ,
> t1.find(u'â') , t2.find(u'â') ,
> t1.find(u"\u00E2") , t2.find(u"\u00E2") ,
> t1.find(u"\u0061\u0302") , t2.find(u"\u0061\u0302") ,
> # --> âme âme 1 2 â a -1 0 0 -1 0 -1 -1 0
> -- -- -- -- -- -- --
> vit esse estrany ☣
This archive was generated by hypermail 2.1.5 : Wed Oct 13 2010 - 10:34:50 CDT