Re: Level of Unicode support required for various languages

From: Eric Muller ([email protected])
Date: Thu Oct 25 2007 - 13:19:10 CDT

Next message: Peter Constable: "RE: Level of Unicode support required for various languages"

Previous message: Andrew West: "Re: Level of Unicode support required for various languages"
In reply to: Timothy Armes: "RE: Level of Unicode support required for various languages"
Next in thread: Anto'nio Martins-Tuva'lkin: "Re: Level of Unicode support required for various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Timothy Armes wrote:
>
> Here is a current list. Can all of these be written without combining marks and variant glyphs?
>

For Thai, the answer is definitely no. Thai uses combining marks for
(some) vowel signs and for tone marks, which do occur in everyday texts.
There are no code points for any of the combinations.

For Vietnamese, the answer is yes, but there is a trap. All the
combinations of letters and marks exist as precomposed characters, hence
the "yes". However, it is common that the representation of text does
not use those precomposed characters (e.g. the Windows keyboard for
Vietnamese generates combining characters). The same trap can exist with
many of the European languages, a lot depend on the generation of the text.

For the languages using the Arabic script (Arabic, Farsi/Persian), there
is a somewhat similar situation: while the generally preferred path is
to use the characters in the U+06xx block, and to have the layout engine
select the appropriate positional form ("variant glyphs" in your
terminology), a fair amount of texts can instead be represented using
the presentation forms in the U+Fxxx blocks, where there is a simple
mapping from the characters to the glyphs.

There is an additional complication for Hebrew and Arabic, because those
scripts are inherently bidirectional, and I doubt that even with your
restricted domain, you can escape that (do you have texts that include
letters and numbers?). If you were to impose that you input is in visual
order, then you would have a hard time claiming Unicode support.

Stepping back a bit, I suppose that you currently have a simple layout
engine that gets some non-Unicode text and assumes that the rendering of
a string is simply the rendering of each "character" one after the
other, and that your immediate task is to take Unicode instead. Your
question is probably "how much work do I need to do on my layout
engine?" It's also probably the case that you deal with relatively
simple texts, and that you deal (today) with a fixed set of language.

There are many factors that would influence the answer, which only you
can really evaluate: the sources of your text data, how much constraints
you can put on those sources, the font technology you use, how much
hardware budget you have, how much you can control which fonts are used,
and so on.

One temptation is to stick to your layout engine and put some
constraints on the incoming data. For example, you could get by for
Vietnamese by imposing that the input data uses only the precomposed
characters. That would be a compliant Unicode implementation (there is
no requirement to support the full repertoire). Similarly, you could
impose that Arabic texts use the presentation forms, and that would also
be compliant. However, all these constraints fundamentally amount to
using Unicode not as text encoding system but as a glyph repertoire
(that's what a *simple* layout engine really implies). While maintaining
knowingly the confusion between text encoding and glyph repertoire can
work to some extent, it is a very tricky to do so. First, you would
really need to understand the gap between the two (which not
surprisingly would amount to be able to specify a non-simple layout
engine), and second, you would need to bridge that gap by other means
than a piece of software. It is also my guess that your current
situation (i.e. set of languages and texts you want to support) is right
at the edge of what can be done using this approach (in fact, proper
bidi support is probably on the "wrong" side of the edge). In short,
that path seems attractive because it minimizes the immediate
engineering work (either your own implementation or the acquisition and
deployment of somebody' else implementation), but it is very harder to
make it work in practice, and it has large hidden costs.

Eric.

Next message: Peter Constable: "RE: Level of Unicode support required for various languages"
Previous message: Andrew West: "Re: Level of Unicode support required for various languages"
In reply to: Timothy Armes: "RE: Level of Unicode support required for various languages"
Next in thread: Anto'nio Martins-Tuva'lkin: "Re: Level of Unicode support required for various languages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 25 2007 - 13:22:05 CDT