Re: WORD JOINER vs ZWNBSP

From: Marcel Schneider <charupdate_at_orange.fr>
Date: Wed, 1 Jul 2015 10:47:55 +0200 (CEST)

On Tue, Jun 30, 2015, Richard Wordingham wrote:

> On Tue, 30 Jun 2015 11:25:43 +0200 (CEST)
> Marcel Schneider wrote:
>
> > On Mon, Jun 30, 2015, Richard Wordingham wrote:
>
> > I tested on Microsoft Word 2010 Starter running on Windows 7 Starter,
> > on a netbook. This software being based on the full versions, the
> > interpretation of U+FEFF must be the standard behavior. I tested in
> > Latin script. You may wish to redo the tests, so please open a new
> > document, input two words, replace the blank with whatever character
> > the word boundaries behavior is to be checked of, and search for one
> > of the two words with the 'whole word' option enabled. If the result
> > is none, the test character indicates the absence of word boundaries;
> > if there is a result, the test character indicates the presence of
> > word boundaries.

Yesterday (On Tue, Jun 30, 2015) already, I wondered how my text could be altered with needlessly suppressed and added line breaks.
Now I wish everybody to take notice that, at least on this Public List, I *never* quoted anybody this way:
 
> At some time in June 2015, Richard Wordingham wrote:

This is why, to get started with this reply, I replaced that line with the accurate one, which can be checked at http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0279.html (except the e-mail address, which is suppressed by the list engine at archiving, and will be so here again):

On Tue, Jun 30, 2015, Richard Wordingham wrote:
_______

> I did my own tests in word 2010 with Windows 7. Although U+FEFF and
> U+2060 displayed differently when I enabled the display of
> 'non-printing' characters (spaces, inactive soft hyphens, non-breaking
> hyphens, paragraph ends etc.), the behaved the same when embedded in
> French l'eau and Thai กก - they changed each word to two words, as
> detected by ctrl/rt-arrow. However, this is wrong.

At the same time, Doug Ewell (to whom I'll reply soon, as well as to Khaled Hosny) was writing exactly what I see at display: a .notdef box. Personally I've enabled for current display: paragraph ends, manual line breaks, tabulation characters, text limits. (Unfortunately I cannot enable separately the display of style separators too. To see them, I must enable all, as Richard did for test.)

Ctrl + RIGHT overrides APOSTROPHEs and in-word single closing-quotes, and can therefore not be used to detect word boundaries.
Perhaps you might consider to run the test as I did. It goes as follows:

1 Open a new document.
2 input two words with a blank between.
3 Replace the blank with whatever character the word boundaries behavior is to be checked of.
4 Do a search for one of the two words with the 'whole word' option enabled.
→ If the result is 'No instance found', the test character indicates the absence of word boundaries.
→ If the result is 'One instance found', the test character indicates the presence of word boundaries.

This way, you will be told by Microsoft Word that the word 'eau' is found, because you used U+0027. Same result with U+2019. It wouldn't be until you use U+02BC, that U+006C U+02BC U+0065 U+0061 U+0075 is considered as a single word. With U+006C U+02BC U+FEFF U+0065 U+0061 U+0075, you will find the word 'eau' again. This is not wrong, given that a word joiner is expected to join words, in order that no NBSP nor any other no-break white space is needed to prevent line breaks between them. However, the words remain words. This is why Ctrl + RIGHT makes a stop at U+FEFF, detecting a word boundary. The overriding of in-word punctuations by quick cursor move is for word processing convenience only, in English as well as in French and other languages. In your example, when 'l'eau' (the water) is to be replaced with its counter-part 'la terre' (the land), when placing the cursor at the end and pressing Ctrl + BACKSPACE, you get the two words deleted and can immediately rewrite the non-elided article and the new word. But, as I say, that is not a test for word boundaries.

> >> No, this doesn't work.
>
> Clarification: It doesn't work in correct software. Correct software
> would have treated the modified words as single words.

As far as belongs to the French example, the elided article and the noun are *already* treated as two words in correct software. There are spell-checkers which don't recognize a word when it is preceded by an elided article with apostrophe, but these are *not* correct software. And they are *not* from Microsoft. About Thai I've no knowledge, but I guess that กก is a correct word, and therefore, correct software will take notice of the U+FEFF or U+2060 you add between the two characters and therefore assume that you mean *two* words but that you just won't have any blank between them. This is not wrong, again, and it is consistent with the fact that correct software complies to the Standards, that the Standards are designed to be useful, and that correct software is useful software.

Talking about software, what use else of being correct?

Marcel
 

> Message du 30/06/15 23:40
> De : "Richard Wordingham"
> A : "Unicode Mailing List"
> Copie à :
> Objet : Re: WORD JOINER vs ZWNBSP
>
> On Tue, 30 Jun 2015 11:25:43 +0200 (CEST)
> Marcel Schneider wrote:
>
> > At some time in June 2015, Richard Wordingham wrote:
>
> > I tested on Microsoft Word 2010 Starter running on Windows 7 Starter,
> > on a netbook. This software being based on the full versions, the
> > interpretation of U+FEFF must be the standard behavior. I tested in
> > Latin script. You may wish to redo the tests, so please open a new
> > document, input two words, replace the blank with whatever character
> > the word boundaries behavior is to be checked of, and search for one
> > of the two words with the 'whole word' option enabled. If the result
> > is none, the test character indicates the absence of word boundaries;
> > if there is a result, the test character indicates the presence of
> > word boundaries.
>
> I did my own tests in word 2010 with Windows 7. Although U+FEFF and
> U+2060 displayed differently when I enabled the display of
> 'non-printing' characters (spaces, inactive soft hyphens, non-breaking
> hyphens, paragraph ends etc.), the behaved the same when embedded in
> French l'eau and Thai กก - they changed each word to two words, as
> detected by ctrl/rt-arrow. However, this is wrong.
>
>
> >> No, this doesn't work.
>
> Clarification: It doesn't work in correct software. Correct software
> would have treated the modified words as single words.
>
> Richard.
>
>
Received on Wed Jul 01 2015 - 03:49:19 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 01 2015 - 03:49:20 CDT