Re: behaviour of ZWNBSP (was Re: Unicode and Kermit)

From: peter_constable@sil.org
Date: Mon Aug 16 1999 - 16:07:08 EDT


>Indeed. But surely this would not be taken as a single word
       even in the absence of ZWNBSPs: it contains a non-alphabetic
       character.

       Let's go back to Mark's examples:

>Suppose that there is a natural word break between XY, and no
       natural word break between YZ. Then here are the word counts:

       XY: 2
       YZ: 1
       X<ZWSP>Y: 2
       Y<ZWSP>Z: 2
       X<ZWNBSP>Y: 1
       Y<ZWNBSP>Z: 1

       In the example I gave (taken from the Unicode book),
       "base+delta", the pair "e+" falls into the category Mark gave
       for XY. His conclusion that adding ZWNBSP yielded one word.

>Thus this example does not settle the issue of
       whether "ap<ZWNBSP>ple" is one word or two.

       I was responding to the statement that this is (they are) two.

>In particular, why create such a thing (other than by
       accident)?
       Perhaps to suppress line-breaking with hyphenation? In that
       case "apple" remains one word, and the ZWNBSP is serving as a
       mandatory non-hyphen.

       I wouldn't normally be inclined to do so in this particular
       spot, unless, along the lines you suggest, I were writing about
       hyphenation, used "ap-ple" as an illustration, but specifically
       did not want this to break across lines to avoid confusion as
       to whether the hyphen were part of the example or an artifact
       of the layout.

       Actually, what I have in mind is hypothetical - I don't know if
       this would ever arise, and I can't think of any specific
       examples from Thai or another language that would qualify:

       In the English string "Mr. Smith", I might prefer not to have a
       line break between the words "Mr." and "Smith". Of course, we
       have NBSP for that purpose. Suppose, this scenario, however: I
       have a corpus of data for a language that, like Thai, is
       written without visible spaces between all words, and that I am
       using ZWSP to delimit any word boundaries not delimited by SP,
       PS, etc. I have, however, certain word pairs that, like "Mr.
       Smith", I don't want to break across a line. It seemed obvious
       that ZWBNSP is exactly what is needed.

       In other words, ZWBNSP is to ZWSP what NBSP is to SP, but
       useful mostly for writing systems where not all word boundaries
       are overtly indicated with visible space.

       Now, perhaps I'm making an assumption here, that

       X<SP>Y
       Y<SP>Z
       X<NBSP>Y
       Y<NBSP>Z

       would all count as two words for selection and arrow key
       movement, though the line breaking behaviour is, of course,
       different. This assumption seems reasonable to me: Given the
       string

       Mr.<NBSP>Smith

       I wouldn't want a double click on "M" to select the entire
       string, I wouldn't want a keying of CTRL-RT_ARROW to jump over
       the entire string, and I wouldn't want my spell checker to
       treat it as one word.

       If this assumption is valid here, then I would expect ZWSP and
       ZWNBSP to relate to one another in exactly the same way that SP
       and NBSP do, and I'd expect comparable behaviours from the
       former pair as I would for the latter pair, except that the
       former don't have width.

       That's what I would have expected. Maybe there's a
       well-developed understanding in the industry of how these
       characters should behave, in which case I'd like to learn more
       about that understanding and the basis for it. If not, then I'd
       like to suggest this as a possible set of behaviours for these
       characters.

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT