Re: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

Date: Sat Oct 27 2007 - 19:53:57 CDT

  • Next message: Anto'nio Martins-Tuva'lkin: "Afaka script"

    Dear Ed,

    rather than 'competing' system this could in fact be a complimentary system.

    Such ideas are being considered for at least two research projects
    into unencoded characters that I know of. In fact, one might say,
    Wenlin's CDL is an example of such a system.


    Quoting Ed Trager <>:

    > Hi, everyone,
    > Although a component-based system of encoding Han ideographs clearly
    > did not happen --and is not going to happen-- in the Unicode Standard,
    > there is no reason why such a system and standard could not be now
    > devised --along with reference implementations-- by an enterprising
    > community of people worldwide interested in creating a new, possibly
    > competing, and certainly less-limiting future standard for the
    > encoding of textual information using Han ideographs.
    > One can rather easily imagine an Open Source-style project which would
    > set out to define a new and independent standard for encoding Han
    > ideographs based on their components and the relative positioning of
    > those components.
    > Any ideographs so encoded which map to ideographs currently encoded in
    > Unicode could simply be rendered using existing Unicode CJK fonts
    > which already contain the relevant "precomposed" glyphs.
    > As for those ideographs not yet encoded in Unicode, or those rare
    > historical or modern oddities and variants which will never be encoded
    > in Unicode, such a system would need to provide a "composing engine"
    > capable of doing at least a half-decent job at composing ideographs
    > from the set of base components. Writing such an engine would be a
    > great challenge, which might make it even more likely to actually
    > happen, as smart people everywhere on the planet generally enjoy a
    > good challenge :-) .
    > Such a "composing engine" could eventually be tied into existing or
    > future text layout and font rasterizing engines, thus allowing
    > noodle-eaters everywhere to be able to write about how tasty that dish
    > of "biang2 biang2" noodles* they had yesterday was, or parents to name
    > their cute babies using uniquely cute ideographs invented by
    > themselves, or enterprising marketeers to gain marketshare by
    > inventing new ideographs for their "As Seen On TV" products.
    > Of course there would be many important real-world and scholarly
    > applications if such a standard and system existed too. :-)
    > (* )
    > -- Ed Trager
    >> On Oct 25, 2007, at 11:41 PM, wrote:
    >> An even more effcient solution as far as code points, would have
    >> been to encode the components of Chinese characters, not precomposed
    >> charcters, this would take up over 10 thousand code points to encode
    >> the current 70 thousand unicode charcters, and include over 80% of
    >> all CJKV submissions. In this case new submissions would be
    >> resticted to new components. This way all cjkv would be in the BMP.
    > On 10/27/07, <> wrote:
    >> Dear Gerrit,
    >> IMHO you are correct, the biggest obstacle was not technical, but
    >> other factors.
    >> John
    >> Quoting Gerrit Sangel <>:
    >> > Excuse me if I am wrong, but according to Wikipedia, the original Cangjie
    >> > method mastered this in the 80s or so. And I do not think the computer at
    >> > that time were really sophisticated.
    >> >
    >> > Could it not have been solved like the ligatures in TeX? I mean,
    >> TeX masters
    >> > some features other apps still cannot do now.
    >> >
    >> > I think, a possibility would have been to store the text like ?
    >> > (U+5973) and ?
    >> > (U+99AC) and generate ? (U+5ABD) via some kind of ligatures. This
    >> could then
    >> > be stored in the font, which describes that if ? is followed by ? and a
    >> > character for ?next character? it should generate ?.
    >> >
    >> > This could have then spanned the ordinary CJK range, but if some kind
    >> > of ?unknown? character is typed in, it could still be stored
    >> (maybe in a more
    >> > inferior quality in display, but still it would not have needed a code
    >> > point).
    >> >
    >> > Regards
    >> > Gerrit Sangel
    >> >
    >> > Am Freitag 26 Oktober 2007 schrieb John H. Jenkins:
    >> >> it would
    >> >> have required technical support beyond the abilities of then-current
    >> >> systems, it would have made East Asian texts take even *more* space
    >> >> than they do now and made them more difficult to process.
    >> >

    This message sent through Virus Free Email

    This archive was generated by hypermail 2.1.5 : Sat Oct 27 2007 - 19:58:05 CDT