Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: Ed Trager ([email protected])
Date: Sat Oct 27 2007 - 11:28:06 CDT

Next message: Michael Maxwell: "RE: thorn vs. y or th, eth and other similar letters/signs"

Previous message: Philippe Verdy: "thorn vs. y or th, eth and other similar letters/signs (was: Level of Unicode support required for various languages)"
Next in thread: [email protected]: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: [email protected]: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: John H. Jenkins: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: Jeroen Ruigrok van der Werven: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi, everyone,

Although a component-based system of encoding Han ideographs clearly
did not happen --and is not going to happen-- in the Unicode Standard,
there is no reason why such a system and standard could not be now
devised --along with reference implementations-- by an enterprising
community of people worldwide interested in creating a new, possibly
competing, and certainly less-limiting future standard for the
encoding of textual information using Han ideographs.

One can rather easily imagine an Open Source-style project which would
set out to define a new and independent standard for encoding Han
ideographs based on their components and the relative positioning of
those components.

Any ideographs so encoded which map to ideographs currently encoded in
Unicode could simply be rendered using existing Unicode CJK fonts
which already contain the relevant "precomposed" glyphs.

As for those ideographs not yet encoded in Unicode, or those rare
historical or modern oddities and variants which will never be encoded
in Unicode, such a system would need to provide a "composing engine"
capable of doing at least a half-decent job at composing ideographs
from the set of base components. Writing such an engine would be a
great challenge, which might make it even more likely to actually
happen, as smart people everywhere on the planet generally enjoy a
good challenge :-) .

Such a "composing engine" could eventually be tied into existing or
future text layout and font rasterizing engines, thus allowing
noodle-eaters everywhere to be able to write about how tasty that dish
of "biang2 biang2" noodles* they had yesterday was, or parents to name
their cute babies using uniquely cute ideographs invented by
themselves, or enterprising marketeers to gain marketshare by
inventing new ideographs for their "As Seen On TV" products.

Of course there would be many important real-world and scholarly
applications if such a standard and system existed too. :-)

(* http://en.wikipedia.org/wiki/Biang_biang_noodles )

-- Ed Trager

> On Oct 25, 2007, at 11:41 PM, [email protected] wrote:
>
> An even more effcient solution as far as code points, would have
> been to encode the components of Chinese characters, not precomposed
> charcters, this would take up over 10 thousand code points to encode
> the current 70 thousand unicode charcters, and include over 80% of
> all CJKV submissions. In this case new submissions would be
> resticted to new components. This way all cjkv would be in the BMP.
>

On 10/27/07, [email protected] <[email protected]> wrote:
> Dear Gerrit,
>
> IMHO you are correct, the biggest obstacle was not technical, but
> other factors.
>
> John
>
> Quoting Gerrit Sangel <[email protected]>:
>
> > Excuse me if I am wrong, but according to Wikipedia, the original Cangjie
> > method mastered this in the 80s or so. And I do not think the computer at
> > that time were really sophisticated.
> >
> > Could it not have been solved like the ligatures in TeX? I mean, TeX masters
> > some features other apps still cannot do now.
> >
> > I think, a possibility would have been to store the text like ?
> > (U+5973) and ?
> > (U+99AC) and generate ? (U+5ABD) via some kind of ligatures. This could then
> > be stored in the font, which describes that if ? is followed by ? and a
> > character for ?next character? it should generate ?.
> >
> > This could have then spanned the ordinary CJK range, but if some kind
> > of ?unknown? character is typed in, it could still be stored (maybe in a more
> > inferior quality in display, but still it would not have needed a code
> > point).
> >
> > Regards
> > Gerrit Sangel
> >
> > Am Freitag 26 Oktober 2007 schrieb John H. Jenkins:
> >> it would
> >> have required technical support beyond the abilities of then-current
> >> systems, it would have made East Asian texts take even *more* space
> >> than they do now and made them more difficult to process.
> >

Next message: Michael Maxwell: "RE: thorn vs. y or th, eth and other similar letters/signs"
Previous message: Philippe Verdy: "thorn vs. y or th, eth and other similar letters/signs (was: Level of Unicode support required for various languages)"
Next in thread: [email protected]: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: [email protected]: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: John H. Jenkins: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: Jeroen Ruigrok van der Werven: "Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Oct 27 2007 - 11:29:59 CDT