Re: [indic] Re: 28th IUC paper - Tamil Unicode New

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 22 2005 - 20:11:20 CDT

Next message: Richard Wordingham: "Re: [indic] Re: 28th IUC paper - Tamil Unicode New"

Previous message: Adam Twardoch: "Re: 28th IUC paper - Tamil Unicode New"
Next in thread: Richard Wordingham: "Re: [indic] Re: 28th IUC paper - Tamil Unicode New"
Reply: Richard Wordingham: "Re: [indic] Re: 28th IUC paper - Tamil Unicode New"
Reply: Antoine Leca: "Korean [Was: 28th IUC paper - Tamil Unicode New]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Richard responded:

> > That translates to: "it can be displayed with a dumb rendering engine
> > and a simple font".
>
> Largely, yes. I suspect the default Unicode collation would also produce
> the correct results.

Only for the TUNE set, and only if the default table were set up
following its binary order -- which could of course be done.

> > In fact adding TUNE to Unicode "without any awareness of Tamil
> > as a distinct script" is a recipe for disaster.
>
> It junks data in the current encoding. How else is it a recipe for
> disaster?

/head_desk

There seems to be a lot of reality denial going on, presuming,
apparently, that if only TUNE were encoded, the old data and
the old encoding (i.e. what is currently in the standard) will
go away. It won't.

You would only end up with two encodings of the same script,
whether one is in PUA (which would be essentially useless, as
others have been pointing out) or not. And those two encodings
*would* coexist. And the complexity that results doesn't scale
linearly. In an attempt to make Tamil *simpler*, this proposal
is heading towards a disaster where it makes Tamil ineluctably
*more* complex in the encoding. *Much* more complex.

I'll say it again: Korean Hangul.

Korean *should* be simple and straightforward.

It isn't.

Why? Because it wasn't encoded once in the standard -- it was
encoded *FOUR* times.

Doubt me? Examine the standard:

Encoding #1: U+1100..U+11F9, as combining jamos

Encoding #2: U+AC00..U+D7A3, as preformed syllables

Encoding #3: U+3131..U+318E, as compatibility jamos

Encoding #4: U+FFA0..U+FFDC, as halfwidth jamos

Representing the *same* Korean text is done distinctly for each
of those encodings.

And hey, sorting Encoding #2 is easy, because all the syllables
are laid out in the collation order, so binary works just fine.
Sound familiar?

But sorting *Korean* in Unicode is a bloody, awful nightmare
with edge cases galore, because the encoding is such a mess
to begin with. If you are dealing with any data originating
from encoding #3 or #4, you have to put in place transducers
to convert representation, or get only partially correct
results. And even for encodingd #1 and #2, which are meant to
work with each other and which have canonical equivalence relations
built in, you *still* have funky edge cases because the combining
jamos are more expressive than then preformed syllables
(which don't cover ancient Hangul), and you can't depend just on
the binary order of the preformed syllables -- which was one
of the big reasons for creating them in the first place.

Making encodings *more* complex does not make them simpler to
process.

Adding a *second* encoding for Tamil, no matter that it be
divinely inspired and self-evident, does *NOT* make Unicode
processing of Tamil data simpler.

> > you have to make the software
> > *aware* of the Tamil script to establish the equivalences between the
> > existing Tamil encoding and the TUNE encoding.
>
> Are such canonical equivalences now permitted?

If the claim were to be for identity of interpretation, as for
combining jamos versus an equivalent preformed Hangul syllable,
then you'd be committing yourself to canonical equivalences.
As far as I can tell, there is nothing in the TUNE table that
cannot already be represented with the existing Tamil characters.

But if you commit to introduction of characters with canonical
equivalences, you might as well give up right there. Such
additions accomplish nothing except force everyone to do the
canonical mappings to normalize the data. And it wouldn't
normalize *to* the TUNE representations, but away from them.

> I suppose they could be made
> equivalent in the default Unicode collation algorithm.

Yes, if you didn't claim actual interpretive equivalence, but
simply a compatibility equivalence, then you could import the
complexity of the mapping into the collation algorithm. But
any process that was not using a full-blown collation tailoring
for Tamil, but expected normalization to do the equivalencing,
would end up with the wrong answers.

> Another, nasty
> issue, is that if they were canonically equivalent, conversion from TUNE
> characters to NFD (thus current Tamil) would make text dependent on
> sophisticated rendering, and defeat a large part of the point of TUNE.

Precisely.

Or more correctly, it would defeat the entire point of TUNE.

And more -- because the resulting encoding would be more complex
than if TUNE had never been considered in the first place.

> > Encoding TUNE, whether in the PUA or elsewhere, *without any
> > awareness of Tamil as a distinct script*, defeats the purpose
> > of an encoding in the first place.
>
cript*, defeats the purpose
> > of an encoding in the first place.
>
> Please enlighten me. What's fundamentally wrong with having LATIN LETTER
> TAMIL K, LATIN LETTER TAMIL KA, etc?

Huh? Other than the fact that TAMIL KA isn't a Latin letter?

I suppose we could have CJK IDEOGRAPH TAMIL K, too, for that matter,
but I don't see how that helps any. :-)

What I was responding to was your claim (or perhaps your
interpretation of the implicit claim behind the TUNE proposal)
that New Tamil could be rendered in a dumb way "without any
awareness of Tamil as a distinct script". That is of course
true at a certain level, particularly if you consider the
issue *only* for New Tamil, as if this were simply another
8-bit font hack solution. It *isn't* true once you try to make
New Tamil work *in* Unicode -- at that point the fact that these
are another representation of Tamil characters becomes critical
to proper behavior of everything, and you *CANNOT* treat the
encoding as if it were a de novo simple script. It isn't: as
proposed it is a *RE*-encoding of an existing encoded script
with complex behavior. That is the difference.

> I thought scripts were chiefly
> relevant in Unicode because characters in the same script tend to have
> similar properties and have to work together.

They do have to work together, but not because they have "similar
properties". Characters in a script often have very distinct
properties -- e.g. base characters versus combining marks.

Scripts are chiefly relevant because they delimit the
identity of characters and because in implementations they
trigger distinct rendering logic and font choices.

--Ken

Next message: Richard Wordingham: "Re: [indic] Re: 28th IUC paper - Tamil Unicode New"
Previous message: Adam Twardoch: "Re: 28th IUC paper - Tamil Unicode New"
Next in thread: Richard Wordingham: "Re: [indic] Re: 28th IUC paper - Tamil Unicode New"
Reply: Richard Wordingham: "Re: [indic] Re: 28th IUC paper - Tamil Unicode New"
Reply: Antoine Leca: "Korean [Was: 28th IUC paper - Tamil Unicode New]"
Reply: Antoine Leca: "Korean [Was: 28th IUC paper - Tamil Unicode New]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Aug 22 2005 - 20:12:25 CDT