From: Peter Kirk (email@example.com)
Date: Mon Aug 11 2003 - 19:28:19 EDT
On 11/08/2003 12:26, Kenneth Whistler wrote:
>Peter Kirk wrote:
>>I think this may be a "Peter mistake". I meant to refer to spacing
>>It is certainly highly inappropriate for spacing diacritics to
>>be considered word boundaries.
>Why? It is entirely dependent on the orthography and conventions
Well, agreed, there may be orthographic conventions in which a spacing
diacritic is considered a word boundary or a break opportunity e.g. if
used like a hyphen. But there are other mechanisms for forcing a word
boundary where otherwise there would not be one. Are there to suppress a
word boundary? Perhaps I need to encode <WJ, space, diacritic, WJ> to
avoid the word boundary implication? Would this work?
>... There is probably as much (or more) bad ASCII usage
>of spacing diacritics like `this', where a grave accent character
>is being misapplied to make a directional quotation mark, as
>there is actual, linguistically appropriate use of spacing
But this is an abuse of the spacing diacritic as punctuation. Proper,
linguistically appropriate use of spacing diacritics should not be
broken in order to support abuse. Or, if the standard wants to support
such abuse, we can reserve <space, diacritic> for the abuse and define
a new character XXX such that <XXX, diacritic> has the properties for
the linguistically appropriate use.
>Also, everyone should consider carefully the status of UAX #29,
>This is informative material. There are many different ways to
>divide text elements corresponding to grapheme clusters, words
>and sentences, and the Unicode Standard and this document do not
>restrict the ways in which implementations can do this.
>This specification is a <emphasis>default</emphasis> mechanism;
>more sophisticated engines can and should tailor it for particular
>locales or environments. ...
>The whole UAX is informative. ...
Then let it be correctly informative and not full of misinformation. And
let its default mechanism and recommendations be appropriate for the
majority of uses, including such cases as list of diacritics which may
occur in any orthography.
Ken, it seems to me all the more clearly from looking at the latest
batch of postings on this list that the <space, diacritic> mechanism
defined by Unicode is fundamentally flawed. It works, but it creates a
serious and needless complication for all kinds of other processes,
including rendering and higher level processes. These processes cannot
simply take a space as a space and process it as such. Every time they
come across a space (which is very often!) they have to test whether it
is followed by a combining character, and if it is they have to treat
that space specially. This has created a serious problem for
implementers, which is why they have produced non-conforming
implementations - and we are not talking about small companies which
have rushed into the market recently, we are talking about Microsoft,
among others, which has been sponsoring Unicode for the start, I
understand. Surely the UTC should not create difficulties for
implementers and then just shout at them for getting things wrong. The
UTC should try to produce a standard which is workable without
I agree that it works better to use NBSP here. There are fewer such
problems, but they have not gone away entirely. And NBSP is more likely
to be treated by implementers (in the absence of other guidelines from
Unicode) as fixed width, not trimmed to the width needed for the diacritic.
-- Peter Kirk firstname.lastname@example.org (personal) email@example.com (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 21:14:14 EDT