From: Peter Kirk (peter.r.kirk@ntlworld.com)
Date: Tue Aug 12 2003 - 07:01:36 EDT
On 11/08/2003 18:46, Mark Davis wrote:
>There are a number of incorrect statements. My comments below.
>
>
Thanks for the clarifications. Sorry about the inaccuracies. On some
maybe Philippe misled me, on others it is just my inadequate understanding.
>...
>
>In practice, looking at a character past a space does not represent a
>significant performance issue. One is typically using a mechanism
>(like an augmented state machine) that maintains enough state that
>that is not an issue.
>
>
Understood. I hope Microsoft is listening.
> ...
>
>It helps if "concrete proposals" were actually, well, concrete.
>
>
Of course! But I need help to get rid of any inaccuracies before the
concrete sets.
>I see no problem with Line Break.
>(http://www.unicode.org/reports/tr14/#Algorithm):
>
>Space + NSM is treated as a unit, with behavior that is pretty
>consistent with a stand-alone accent like "^". To quote:
>
>LB 7a In all of the following rules, if a space is the base character
>for a combining mark, the space is changed to type ID. In other words,
>break before SP CM* in the same cases as one would break before an ID.
>
> Treat SP CM* as if it were ID
>
>If you want non-breaking behavior, you use NBSP + NSM; if you want
>breaking behavior, you use SP + NSM. The algorithm does that.
>
>
Thank you. I have looked at this. Well, the ideal for me would be a
mechanism whereby base + NSM was AL, rather than ID or GL. The problem
comes, if I understand correctly, with a sequence like SP XX CM* AL,
where I want a break opportunity after SP but not before AL. If I use
NBSP for XX, I get not breaking opportunity at all. If I use SP, I may
get a break before AL. But I suppose SP SP CM* WJ AL would do what I
want, perhaps also SP ZWSP NBSP CM* AL as the break opportunity after
ZWSP takes precedence over the no break before NBSP.
>I also see no problem with word-break
>(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
>specific text. To quote:
>
>Treat a grapheme cluster as if it were a single character: the first
>character of the cluster.
> GC → FC (3)
>...
>Otherwise, break everywhere (including around ideographs).
> Any ÷ Any (14)
>
>None of the other rules are relevant.
>
>So what this does is that SPACE + NSM will break before the space and
>after the NSM (assuming there is only one). So it will behave like a
>symbol, such as "*", or ")", or "^".
>
>
OK, no real problem then. In some circumstances it might have been
better for space + NSM to behave like a letter rather than a symbol may
be more appropriate, but I recognise that tailoring may be required for
fine details.
>The one area I do see that there may be an issue is with one that you
>didn't mention,
>http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
>should not behave as Sp in the rules (8), (10), and (11). Even there,
>it will produce at most a minor oddity.
>
>If we wanted to change it, the *concrete* change would be to replace
>(4) by:
>
>Treat a grapheme cluster as if it were a single character: the first
>character of the cluster, except if that first character is a space.
>In that case, change to Any.
> SGC → FC (4a)
> GC → FC (4b)
>
>
Do you mean: "SGC → Any (4a)"?
How should I go about making a concrete proposal for this?
Anyway, many thanks for your help. I think I am beginning to realise
that this is a small problem which has been blown out of proportion by
others. I still see the space + NSM choice as a rather poor initial
design, but one which can be lived with.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Tue Aug 12 2003 - 23:05:29 EDT