The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Thu Oct 30, 2014 11:42 am

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 11 posts ] 
Author Message
 Post subject: Interrupting grapheme clusters
PostPosted: Sun Apr 29, 2012 5:01 pm 
Offline

Joined: Sun Apr 29, 2012 1:52 pm
Posts: 5
Is there a character for interrupting grapheme clusters? I am programming a string interface where an index refers to a grapheme cluster and not a code point. When concatenating strings, it would be preferable to preserve the length so that the length of the first string + the length of the second equals the length of the new string, or it could lead to bugs that are hard to detect. This cannot be guaranteed when the end of the first string combined with the beginning of the second, forms a cluster.
A solution would be to insert a cluster interrupt character that would not count as a grapheme, if one existed. ZWSP is not an option because it affects the word boundaries. CGJ, ZWNJ and the likes extends.
Besides, would it not be preferable to have the option to display combining marks separately, if so desired? Is there no such character?
If no such character exists, what is the justification for including the analogue to digraphs and words, and not for graphemes?


Last edited by emimull on Mon Apr 30, 2012 3:51 am, edited 1 time in total.

Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Sun Apr 29, 2012 6:38 pm 
Offline

Joined: Mon Feb 01, 2010 6:18 pm
Posts: 79
Combining marks are displayed independently by using ZWSP as the base character. Is there any reason why a Private Use character won't serve your grapheme boundary purpose? After all, the entire purpose of the Private Use Areas are to enable people to represent scripts, characters, and functions that are not covered by named characters.


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Sun Apr 29, 2012 8:30 pm 
Offline

Joined: Sun Apr 29, 2012 1:52 pm
Posts: 5
The resulting string must be portable, so that is a problem :)


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Sun Apr 29, 2012 8:47 pm 
Offline

Joined: Mon Feb 01, 2010 6:18 pm
Posts: 79
The CGJ is really the only thing that seems to match what you are looking for. Please make sure to actually read up on what it is and isn't. It doesn't actually join graphemes. In fact, it separates two characters that would normally be considered a digraph.


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Sun Apr 29, 2012 9:05 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
OK. Let's say I buy the requirement that you have a string interface where each string must consists of an integral number of well-formed grapheme clusters.

If that's the case, concatenating two strings would not violate that condition.

In order to make your interface predictable, the restriction needs to be the same whether the second string is ever used to concatenate to the first. That means, your definition of well-formed grapheme cluster is such that the starting (or base) characters that you allow must be distinct from any of the continuing or extending characters.

If a restriction to such grapheme clusters is possible, you can then treat any string that starts with an extending character as malformed. With malformed strings, it makes a lot less sense to guarantee things like "word boundaries" so using NBSP on concatenation would be perfectly fine. (And way preferable to novel uses of CGJ).

In fact, your string interface should automatically "prepend" such a NBSP whenever someone presents it with a string that starts with a malformed (or degenerate) grapheme cluster.

The upside of that approach is that any string with the same human-readable contents has the same length. With your approach, you would get two strings that look the same but are different on an invisible level. That sounds like a recipe for subtle bugs.

The downside of enforcing well-formed grapheme clusters is that you cannot edit any grapheme cluster except as a unit, and during typing partial units cannot be part of any string.


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Mon Apr 30, 2012 10:36 am 
Offline

Joined: Sun Apr 29, 2012 1:52 pm
Posts: 5
CGJ cannot be used because it does not interrupt clusters (see UAX #29).

Also, here is an excerpt from the manual, chapter 16: "[...] As a result of these properties, the presence of a combining grapheme joiner in the midst of a combining character sequence does not interrupt the combining character sequence; any process that is accumulating and processing all the characters of a combining character sequence would include a combining grapheme joiner as part of that sequence. This differs from the behavior of most format control characters, whose presence would interrupt a combining character sequence."

After thinking about this for a bit, I have decided to go for the ZWSP, even though I was against it at first because it impacts rendering and word boundaries. Concatenation of malformed strings should not be the responsibility of the interface to correct, rather the user; but it should try to minimize the occurrences of potential hard-to-track bugs that might result from this, even if it impacts rendering. I think I'd prefer ZWSP over NBSP because I feel that it impacts rendering to a lesser degree than NBSP does.
In my interface, the ZWSP will be ignored as a grapheme cluster and no such character will be assigned to an index, thus possibly minimizing some subtle bugs.

Thanks vanisaac and asmus for your thoughts on this :)

P.S. Please don't consider my first post a critique. It was in fact meant as a sincere question -- I did not know if a character existed or not. However, it does seem a tiny bit strange that there exists interrupts for sentences, words and digraphs, but not in the same sense for graphemes.
As a side note on the enforcement of well-formed grapheme clusters; in my interface it is possible to append code points and not strings to grapheme clusters, thus enabling the user to compile sets of combining characters if so desired.


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Mon Apr 30, 2012 1:23 pm 
Offline

Joined: Mon Feb 01, 2010 6:18 pm
Posts: 79
Just make sure that you really understand what a combining character sequence is in Unicode before you reject CGJ.


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Mon Apr 30, 2012 2:12 pm 
Offline

Joined: Sun Apr 29, 2012 1:52 pm
Posts: 5
CGJ separates digraphs, which are higher entities than graphemes. CGJ is classified under 'Extend', and so to break a grapheme cluster when a CGJ is encountered would break the Grapheme Cluster Boundary Rules. I would like to conform to the Unicode extended grapheme clusters variant. I hope I'm not missing something?

Thanks :)


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Mon Apr 30, 2012 3:08 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
@emimull,

Please don't consider my earlier reply to you a critique.

I just start from the observation that before one considers the lack of (or addition of) characters in (to) the standard, it's useful to consider whether a dedicated character is in fact required.

Hiding from users the effect of what (in your design) are malformed sequences is IMHO counterproductive. In fact, an argument could be made for selecting some "in your face" character so that such sequences would be painfully obvious.

But your choice is also somewhat defensible and it's your prerogative to make that call.

This leaves us, however, with no use case that would require a true "grapheme cluster segmenter"-character. Hence the discussion of the need for one, including further discussions of CGJ can be postponed until a better use case comes around.


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Mon Apr 30, 2012 4:23 pm 
Offline

Joined: Sun Apr 29, 2012 1:52 pm
Posts: 5
@asmus:

Consider the following case unrelated to my interface: A user wishes to display a letter and a combining mark separately. This can be achieved by inserting a ZWSP in-between. However, this imposes side-effects both to the actual rendering and possibly internal side-effects related to word boundaries. There is no way around it and the user must accept the situation as is. Albeit not a big deal in most cases, but IMHO it should warrant an additional character.

In relation to programming, there are already libraries that approaches strings in a similar manner, and so mine is not an isolated problem. On rare occasions, I believe that very subtle, hard-to-track bugs may occur, and that is a security issue. One should think that Unicode would be preoccupied with security.

To quote ICU: "String searching, also known as string matching, is a very important subject in the wider domain of text processing and analysis" (from a document describing string iteration involving grapheme cluster boundaries).

How you consider these examples as 'no use cases' I don't understand. If your opinion was that these are negligible ones, that would be fine.

Unless I have quoted the manuals falsely; the CGJ character is as unrelated to the proposed character as Latin is to Arabic. So let us postpone the discussion of it indefinitely.


Top
 Profile  
 
 Post subject: Re: Interrupting grapheme clusters
PostPosted: Sat May 05, 2012 3:46 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
If you want to display a combining mark in isolation, the standard way to do that is by using NBSP as a "base" character. If you need to allow breaks before the isolated character, then you can insert a ZWSP before the NBSP.

There's no use case that I know of where you need to be able to display a combining mark (in isolation) in the middle of a word, while preserving all the word breaks as if the combining mark was used in context. Those "requirements" seem to me self-contradictory.

If you have come across an actual case where the standard recommended method didn't work, then perhaps you can provide an example, with specific code points and the explanation of what combination of circumstances made it so different from the general case that is covered by the recommendation in the standard.

Your hypothetical case: "A user wishes to display a letter and a combining mark separately. This can be achieved by inserting a ZWSP in-between. However, this imposes side-effects both to the actual rendering..." I would consider self-contradictory in its requirements. If you want to display a letter and and accent separately as you write, then they should render separately as well. Insisting that the rendering should look indistinguishable in both cases seems to make no sense and is counter to the way the standard defines combining characters.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 2 guests


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com