Re: REALLY *not* Tamil - changing scripts (long)

From: Keld Jørn Simonsen (
Date: Mon Jul 29 2002 - 18:47:53 EDT

On Mon, Jul 29, 2002 at 03:21:03PM -0700, Kenneth Whistler wrote:
> > > It's *much* easier -- and, in the long term, safer -- for them to
> > > select from the extensive inventory of characters available in Unicode and
> > > to avoid using ASCII punctuation characters with redefined word-building
> > > semantics.
> >
> > I don't get what you are saying here, why should people be limited to
> > ASCII punctuation characters?
> That isn't what Peter was saying. You are confused here by your misinterpretation
> of what he was saying.
> The recommendation that Peter was making is that people devising orthographies
> for languages should stick to Unicode letters for the letters of their
> orthography. (If the script in question is Latin, as most new orthographies
> are, then there are *hundreds* of Latin letters to choose from in the standard.)

> What orthography developers should avoid is using characters like "7" "@" "!"
> "$", "'" and so on as letters of their orthography, since those are certain
> to cause all kinds of havoc with word-break and other processes for standard
> software -- or even lead to the kind of absurdities as people wanting illegal
> constructs like: 'jo', which locales can*not* fix.
OK, I now understand, and agree with your recommodation to avoid 7 and @ etc
in newly designed orthographies. I am not sure about established
orthographies, tho.

> > With GNU libc you can declare your own set
> > of punctuation characters in the locale, and they can be any 10646
> > character.
> Peter was talking about the opposite case. But you should examine carefully
> what the implications are of your suggestion here. If I were to make the
> absurd choice of picking 18 Chinese characters to serve as my punctuation
> characters, and then went through the exercise of declaring my own
> locale with GNU libc, I would only be guaranteeing that my locale (and all
> my text data) would only function correctly in a microscopic environment
> that I defined (or could browbeat a few others to share).
> The reason for sticking to the Universal Character Set and for sticking
> to standardized properties for the characters in that set is to
> guarantee widespread interoperability and to guarantee that my text,
> in my language, works correctly in all off-the-shelf software -- not
> merely in my own hacked-up locale.

In Linux, for a specific locale, it is relatively easy to get the new locale
to work on all off-the-shelf software: you need to write the locale, and
submit it to the glibc people, but then - in about 6 months or so, it
would be available on all mainsteam new Linux distributions, off the
shelf. And all applicatuions would adhere to it, given Linux' advanced
i18n technology.

> Serious orthography designers should not allow themselves to get
> stuck in such dead-end traps.

I am not fully sure of which characters they were talking about, but
I was thinking about just a few special additions to the common set of
punctuation characters, problably determined by an established
orthography, and for that purpose I think it would be Ok to add these
punctuation characters. Otherwise I agree that you should stick to an
established set of attributes, like what has been done in Linux
(which uses ISO TR 14652 tecnology for the character properties, as you

Kind regards

This archive was generated by hypermail 2.1.2 : Mon Jul 29 2002 - 16:48:43 EDT