Re: Questions on ZWNBS - for line initial holam plus alef

From: Peter Kirk (
Date: Tue Aug 12 2003 - 07:01:36 EDT

  • Next message: Peter Kirk: "Re: Questions on ZWNBS - for line initial holam plus alef"

    On 11/08/2003 18:46, Mark Davis wrote:

    >There are a number of incorrect statements. My comments below.
    Thanks for the clarifications. Sorry about the inaccuracies. On some
    maybe Philippe misled me, on others it is just my inadequate understanding.

    >In practice, looking at a character past a space does not represent a
    >significant performance issue. One is typically using a mechanism
    >(like an augmented state machine) that maintains enough state that
    >that is not an issue.
    Understood. I hope Microsoft is listening.

    > ...
    >It helps if "concrete proposals" were actually, well, concrete.
    Of course! But I need help to get rid of any inaccuracies before the
    concrete sets.

    >I see no problem with Line Break.
    >Space + NSM is treated as a unit, with behavior that is pretty
    >consistent with a stand-alone accent like "^". To quote:
    >LB 7a In all of the following rules, if a space is the base character
    >for a combining mark, the space is changed to type ID. In other words,
    >break before SP CM* in the same cases as one would break before an ID.
    > Treat SP CM* as if it were ID
    >If you want non-breaking behavior, you use NBSP + NSM; if you want
    >breaking behavior, you use SP + NSM. The algorithm does that.
    Thank you. I have looked at this. Well, the ideal for me would be a
    mechanism whereby base + NSM was AL, rather than ID or GL. The problem
    comes, if I understand correctly, with a sequence like SP XX CM* AL,
    where I want a break opportunity after SP but not before AL. If I use
    NBSP for XX, I get not breaking opportunity at all. If I use SP, I may
    get a break before AL. But I suppose SP SP CM* WJ AL would do what I
    want, perhaps also SP ZWSP NBSP CM* AL as the break opportunity after
    ZWSP takes precedence over the no break before NBSP.

    >I also see no problem with word-break
    >( Look at the
    >specific text. To quote:
    >Treat a grapheme cluster as if it were a single character: the first
    >character of the cluster.
    > GC → FC (3)
    >Otherwise, break everywhere (including around ideographs).
    > Any ÷ Any (14)
    >None of the other rules are relevant.
    >So what this does is that SPACE + NSM will break before the space and
    >after the NSM (assuming there is only one). So it will behave like a
    >symbol, such as "*", or ")", or "^".
    OK, no real problem then. In some circumstances it might have been
    better for space + NSM to behave like a letter rather than a symbol may
    be more appropriate, but I recognise that tailoring may be required for
    fine details.

    >The one area I do see that there may be an issue is with one that you
    >didn't mention,
    > Sp + NSM
    >should not behave as Sp in the rules (8), (10), and (11). Even there,
    >it will produce at most a minor oddity.
    >If we wanted to change it, the *concrete* change would be to replace
    >(4) by:
    >Treat a grapheme cluster as if it were a single character: the first
    >character of the cluster, except if that first character is a space.
    >In that case, change to Any.
    > SGC → FC (4a)
    > GC → FC (4b)
    Do you mean: "SGC → Any (4a)"?

    How should I go about making a concrete proposal for this?

    Anyway, many thanks for your help. I think I am beginning to realise
    that this is a small problem which has been blown out of proportion by
    others. I still see the space + NSM choice as a rather poor initial
    design, but one which can be lived with.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Tue Aug 12 2003 - 23:05:29 EDT