Re: Questions on ZWNBS - for line initial holam plus alef

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Aug 11 2003 - 21:46:18 EDT

  • Next message: Lisa Moore: "Unicode 4.0 is online at last!"

    There are a number of incorrect statements. My comments below.

    ----- Original Message -----
    From: "Peter Kirk" <peter.r.kirk@ntlworld.com>
    To: "Kenneth Whistler" <kenw@sybase.com>
    Cc: <unicode@unicode.org>
    Sent: Monday, August 11, 2003 16:28
    Subject: Re: Questions on ZWNBS - for line initial holam plus alef

    > I was aware that there should not be a line break or word break
    between
    > the space and the NSM, although I suspect that many implementers
    will
    > not be aware of this, or at least will not test for it properly and
    so
    > treat any space as a word break and a line break opportunity.

    Hard to be clearer than what is written in the LineBreak UAX. (see
    below).

    > As I just
    > wrote, this requirement to test all spaces for following NSMs is a
    > significant inefficiency built into the standard.

    This is incorrect. Characters (not just spaces) only need to be
    checked for following NSMs in *those processes where that makes a
    difference*. And in most of those processes, like line-break, some
    lookahead is required anyway. To see, for example, whether there is a
    linebreak after a character X, in almost all cases I have to look at
    the character after X, and in many cases I have to look at more than
    one character. Notice, for example, that in the sequence "a<space>" I
    have to look ahead to see if there is a ":", so that French
    punctuation works correctly.

    In practice, looking at a character past a space does not represent a
    significant performance issue. One is typically using a mechanism
    (like an augmented state machine) that maintains enough state that
    that is not an issue.

    >
    > But there is still a problem if there is considered by default to be
    a
    > word break and a line break opportunity AFTER the NSM. I would
    suggest,
    > as a candidate for a concrete proposal, that the default behaviour
    be
    > adjusted so that there is no word break or line break opportunity
    here
    > either.

    It helps if "concrete proposals" were actually, well, concrete.

    I see no problem with Line Break.
    (http://www.unicode.org/reports/tr14/#Algorithm):

    Space + NSM is treated as a unit, with behavior that is pretty
    consistent with a stand-alone accent like "^". To quote:

    LB 7a In all of the following rules, if a space is the base character
    for a combining mark, the space is changed to type ID. In other words,
    break before SP CM* in the same cases as one would break before an ID.

                            Treat SP CM* as if it were ID

    If you want non-breaking behavior, you use NBSP + NSM; if you want
    breaking behavior, you use SP + NSM. The algorithm does that.

    I also see no problem with word-break
    (http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
    specific text. To quote:

    Treat a grapheme cluster as if it were a single character: the first
    character of the cluster.
                GC โ†’ FC (3)
    ...
    Otherwise, break everywhere (including around ideographs).
                Any รท Any (14)

    None of the other rules are relevant.

    So what this does is that SPACE + NSM will break before the space and
    after the NSM (assuming there is only one). So it will behave like a
    symbol, such as "*", or ")", or "^".

    The one area I do see that there may be an issue is with one that you
    didn't mention,
    http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
    should not behave as Sp in the rules (8), (10), and (11). Even there,
    it will produce at most a minor oddity.

    If we wanted to change it, the *concrete* change would be to replace
    (4) by:

    Treat a grapheme cluster as if it were a single character: the first
    character of the cluster, except if that first character is a space.
    In that case, change to Any.
                SGC โ†’ FC (4a)
                GC โ†’ FC (4b)

    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 22:28:32 EDT