From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Aug 11 2003 - 21:46:18 EDT
There are a number of incorrect statements. My comments below.
----- Original Message -----
From: "Peter Kirk" <peter.r.kirk@ntlworld.com>
To: "Kenneth Whistler" <kenw@sybase.com>
Cc: <unicode@unicode.org>
Sent: Monday, August 11, 2003 16:28
Subject: Re: Questions on ZWNBS - for line initial holam plus alef
> I was aware that there should not be a line break or word break
between
> the space and the NSM, although I suspect that many implementers
will
> not be aware of this, or at least will not test for it properly and
so
> treat any space as a word break and a line break opportunity.
Hard to be clearer than what is written in the LineBreak UAX. (see
below).
> As I just
> wrote, this requirement to test all spaces for following NSMs is a
> significant inefficiency built into the standard.
This is incorrect. Characters (not just spaces) only need to be
checked for following NSMs in *those processes where that makes a
difference*. And in most of those processes, like line-break, some
lookahead is required anyway. To see, for example, whether there is a
linebreak after a character X, in almost all cases I have to look at
the character after X, and in many cases I have to look at more than
one character. Notice, for example, that in the sequence "a<space>" I
have to look ahead to see if there is a ":", so that French
punctuation works correctly.
In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.
>
> But there is still a problem if there is considered by default to be
a
> word break and a line break opportunity AFTER the NSM. I would
suggest,
> as a candidate for a concrete proposal, that the default behaviour
be
> adjusted so that there is no word break or line break opportunity
here
> either.
It helps if "concrete proposals" were actually, well, concrete.
I see no problem with Line Break.
(http://www.unicode.org/reports/tr14/#Algorithm):
Space + NSM is treated as a unit, with behavior that is pretty
consistent with a stand-alone accent like "^". To quote:
LB 7a In all of the following rules, if a space is the base character
for a combining mark, the space is changed to type ID. In other words,
break before SP CM* in the same cases as one would break before an ID.
Treat SP CM* as if it were ID
If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.
I also see no problem with word-break
(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
specific text. To quote:
Treat a grapheme cluster as if it were a single character: the first
character of the cluster.
GC โ FC (3)
...
Otherwise, break everywhere (including around ideographs).
Any รท Any (14)
None of the other rules are relevant.
So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".
The one area I do see that there may be an issue is with one that you
didn't mention,
http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
should not behave as Sp in the rules (8), (10), and (11). Even there,
it will produce at most a minor oddity.
If we wanted to change it, the *concrete* change would be to replace
(4) by:
Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
SGC โ FC (4a)
GC โ FC (4b)
>
> --
> Peter Kirk
> peter@qaya.org (personal)
> peterkirk@qaya.org (work)
> http://www.qaya.org/
>
>
>
This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 22:28:32 EDT