Re: Questions on ZWNBS - for line initial holam plus alef

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Aug 11 2003 - 21:46:18 EDT

Next message: Lisa Moore: "Unicode 4.0 is online at last!"

Previous message: Kenneth Whistler: "Re: Questions on ZWNBS - for line initial holam plus alef"
In reply to: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"
Next in thread: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"
Reply: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

There are a number of incorrect statements. My comments below.

----- Original Message -----
From: "Peter Kirk" <peter.r.kirk@ntlworld.com>
To: "Kenneth Whistler" <kenw@sybase.com>
Cc: <unicode@unicode.org>
Sent: Monday, August 11, 2003 16:28
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

> I was aware that there should not be a line break or word break
between
> the space and the NSM, although I suspect that many implementers
will
> not be aware of this, or at least will not test for it properly and
so
> treat any space as a word break and a line break opportunity.

Hard to be clearer than what is written in the LineBreak UAX. (see
below).

> As I just
> wrote, this requirement to test all spaces for following NSMs is a
> significant inefficiency built into the standard.

This is incorrect. Characters (not just spaces) only need to be
checked for following NSMs in *those processes where that makes a
difference*. And in most of those processes, like line-break, some
lookahead is required anyway. To see, for example, whether there is a
linebreak after a character X, in almost all cases I have to look at
the character after X, and in many cases I have to look at more than
one character. Notice, for example, that in the sequence "a<space>" I
have to look ahead to see if there is a ":", so that French
punctuation works correctly.

In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.

>
> But there is still a problem if there is considered by default to be
a
> word break and a line break opportunity AFTER the NSM. I would
suggest,
> as a candidate for a concrete proposal, that the default behaviour
be
> adjusted so that there is no word break or line break opportunity
here
> either.

It helps if "concrete proposals" were actually, well, concrete.

I see no problem with Line Break.
(http://www.unicode.org/reports/tr14/#Algorithm):

Space + NSM is treated as a unit, with behavior that is pretty
consistent with a stand-alone accent like "^". To quote:

LB 7a In all of the following rules, if a space is the base character
for a combining mark, the space is changed to type ID. In other words,
break before SP CM* in the same cases as one would break before an ID.

Treat SP CM* as if it were ID

If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.

I also see no problem with word-break
(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
specific text. To quote:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster.
GC → FC (3)
...
Otherwise, break everywhere (including around ideographs).
Any ÷ Any (14)

None of the other rules are relevant.

So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".

The one area I do see that there may be an issue is with one that you
didn't mention,
http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
should not behave as Sp in the rules (8), (10), and (11). Even there,
it will produce at most a minor oddity.

If we wanted to change it, the *concrete* change would be to replace
(4) by:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
SGC → FC (4a)
GC → FC (4b)

>
> --
> Peter Kirk
> peter@qaya.org (personal)
> peterkirk@qaya.org (work)
> http://www.qaya.org/
>
>
>

Next message: Lisa Moore: "Unicode 4.0 is online at last!"
Previous message: Kenneth Whistler: "Re: Questions on ZWNBS - for line initial holam plus alef"
In reply to: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"
Next in thread: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"
Reply: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 22:28:32 EDT