Re: Scalability of ScriptExtensions (was: RE: Borrowed Thai Punctuation in Tai Tham Text)

From: Richard Wordingham <>
Date: Tue, 9 Jul 2013 04:08:13 +0100

On Mon, 8 Jul 2013 20:35:05 +0000
"Whistler, Ken" <> wrote:

> But, this query raises a couple further questions for me regarding
> the scalability and maintenance of ScriptExtensions.txt. Basically,
> reports coming in of "Script X character Y is also used with Script Z"
> are proving to be a rather haphazard and ad hoc way of maintaining
> that data file and the related property.

Rather like the ad hoc way of adding characters to Unicode?

> I'm not sure
> what alternative there is now, but find it very distasteful that the
> UTC has been forced into the mode of property maintenance for such
> a subjective and haphazard collection of observations about
> common usage.

The haphazardness is already acknowledged in UAX #24 Revision 19 Section
2.9 Paragraph 2. The currently expressed criteria are subjective.

> The second question is this: what likelihood is there that a full
> implementation of Tai Tham will not also be expected to be
> capable of handling all of Thai?

I would say it is quite high. None of the non-Unicode Tai Tham fonts
I've encountered actually support Thai, except in that they may map
Thai characters to Tai Tham glyphs. If a Tai Tham font is to support
all of Tai Tham, then if it is to include all of Thai, why not all of
Lao and New Tai Lue? Indeed, why not all of Mynamar and even a
passable coverage of Chinese?

On the other hand, it is quite reasonable for a Tai Tham font to
include such non-native characters as nowadays occur within text in the
script. I would consider it prudent to include common European
punctuation in both narrow and full width forms, and reasonable to
include Thai, Lao, Burmese and Chinese currency symbols.

> I ask that because the situation echoes the rather more extensive
> situation of East Asian punctuation usage for ideographic or
> syllabic scripts typeset together with Chinese. Trying to track
> all of those instances down and getting them all enshrined in
> ScriptExtensions.txt strikes me as a losing proposition already --
> and the situation is likely to just get worse as more historic
> scripts from East Asia end up in Unicode eventually.

Hasn't the struggle already been give up for punctuation categorised in
the common script? U+0964 DEVANAGARI DANDA has script extension
property {Common} even though one wouldn't expect to find it in a
script that has its own danda.

> A much more productive approach, it seems to me, would be instead to
> try to establish information about various, identifiable typographical
> traditions for use of punctuation around the world, and then associate
> "exemplar sets" of punctuation used with those traditions. Such an
> approach, I assert, would tend to be much more robust (as well
> as more comprehensible) than definition of very fragile set
> definitions associating lists of scripts one-by-one with various
> characters.

The results feed into two databases - CLDR and script extensions. I
am currently more concerned with script extensions, because that may
affect the splitting of text into script runs. This is most
significant when there is no useful font setting in effect - the
punctuation in the default Thai and in the default Tai Tham fonts may
not match.

Perhaps the point you are making is that the script extensions database
should be based on probable practice or useful non-assumptions. (I say
non-assumption, because we should probably not expect interrobang to
provide evidence of a change of script, though I do wonder how many
scripts actually use it. Should it really occur in cuneiform
or hieroglyphic scripts?) For example, when I've found Thai
punctuation in Tai Tham script in a few more books, should the
observation acted on be that the Tai Tham script uses Thai punctuation
except where the characters are separately encoded?

The probable results that Thai punctuation* is used in Northern Thai in
Tai Tham script and that wide Western punctuation is used in Tai Lue in
Tai Tham script would have to be merged when handling Tai Tham text not
tagged for language.

* Less punctuation separately encoded for Tai Tham.

Received on Mon Jul 08 2013 - 22:13:03 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 08 2013 - 22:13:05 CDT