Re: Controls, gliphs, flies, lemonade

From: John H. Jenkins <>
Date: Tue, 20 Sep 2011 22:07:04 -0600

In re CJK, that's already a FAQ: The short version is: if all you want to do is to draw something, then yes, making up new hanzi on the fly is a solvable problem. If you want to do anything that deals with the *content* (lexical analysis, sorting, text-to-speech), it's an incredibly difficult problem.

And, actually, there's already a way to insert nonstandard hanzi into text (well, two, if you count the Ideographic Variation Indicator), namely Ideographic Description Sequences. They're clumsy and awkward, but they do make it possible to exchange text with unencoded hanzi in a vaguely standard fashion.

And yes, Unicode is very complicated, but that's because of the problem it's intended to solve. If all you're interested in is drawing text in a couple of common scripts, such as Latin and Japanese, then you really don't need Unicode with all of its complexity. Unicode is trying to provide a basis for handling all aspects of plain text processing for all the languages of the world in a single application.

Just go to Wikipedia and look down the long list of different languages that a popular subject has articles in. *That* is what Unicode is trying to provide. It's very tough to implement, but fortunately on all the major platforms, there are libraries that make it unnecessary for you to do all the work yourself.

QSJN 4 UKR 於 2011年9月20日 下午9:01 寫道:

> Yes, i had written 'egyptian hieroglyphs' but how about banal CJK? We
> still have no way to insert nonstandard ideogramme into text. Isn't it
> a simple task? There are just 20 basic strokes :) ok, 500 basic
> symbols. Or 200000? However we can't combine it together :( !
> Unicode is to complex standard. I even don't know how many properties
> have one character (did you know about unicode-coloured characters? -
> there was somewhere that my theme in this list), how can i know how my
> application has to render 'plain' text with bidi, noncanonicordered
> diacritics, and korean script. Right, i don't know that. And my
> application render it in my way, some else in another (a_a / aa_ -
> double comb. char., sure you seen that), so we have no standard at
> all.
> Off course, i can learn this complex standard, but what for? Most of
> them i never use.
> There must be a simpler system, not so many aprior data for it work.
> 2011/9/13, John H. Jenkins <>:
>> QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道:
>>> I know it is sacred cow, but let me just ask, how do you people think.
>>> Is it good or bad that the codepoint means all about character: what,
>>> where, how... (see theme)? Maybe have we separate graph & control
>>> codes - wellnt have many problems, from banal ltr (( rtl instead ltr
>>> (rtl) to placing one tilde above 3, 4, anymore letters, or egyptian
>>> hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text
>>> is at least two codepoints ("what" and "where") in file. Is it stupid?
>>> Trying to render the text we anyway must generate this data.
>> It's not really a sacred cow per se, but it is a fundamental architectural
>> decision which would be pretty much impossible to revisit now.
>> Almost all writing is done using a small set of script-specific rules which
>> are pretty straightforward. English, for example, is laid out in horizontal
>> lines running left-to-right and arranged top-to-bottom of the writing
>> surface. East Asian languages were traditionally laid out in vertical lines
>> running from top-to-bottom and arranged right-to-left on the writing
>> surface.
>> Because some scripts are right-to-left and ltr and rtl text can be freely
>> intermingled on a single line, Unicode provides plain-text directionality
>> controls. The preference, however, is to use higher-level protocols where
>> possible.
>> As for the scripts which are inherently two-dimensional (using
>> hieroglyphics, mathematics, and music), it's almost impossible to provide
>> "plain text" support for them. There is too much dependence on additional
>> information such as the specifics of font and point size. Because of this,
>> the UTC decided long ago that layout for such scripts absolutely must be
>> done using a higher-level protocol to handle all the details.
>> There are occasionally suggestions that positioning controls be added to
>> plain text in Unicode, but so far the UTC has felt that the benefits are too
>> marginal to overcome its reasons for having left them out in the first
>> place.
>> =====
>> Hoani H. Tinikini
>> John H. Jenkins

John H. Jenkins
Received on Tue Sep 20 2011 - 23:11:26 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 20 2011 - 23:11:27 CDT