Re: Level of Unicode support required for various languages

From: John H. Jenkins (
Date: Thu Oct 25 2007 - 23:06:27 CDT

  • Next message: "Re: Level of Unicode support required for various languages"

    On Oct 25, 2007, at 8:14 PM, David Starner wrote:

    > In 10/25/07, <> wrote:
    >> I aaware that the original aim of unicode was to have all 'useful'
    >> characters in the BMP. However as far as CJKV characters are
    >> concerned
    >> this has not been done, rather characters have been added on a first
    >> come first serve basis.
    > The character set standards of China, Taiwan, Japan and Korea were
    > completely included in the BMP. The sets of characters that computer
    > users of CJKV characters were actually using are all in the BMP. That
    > was not a first come, first serve policy.

    Perhaps we have a different sense of what "first come, first serve
    means." To me, the fact that the PRC, Taiwan, Japan, and South Korea
    already had well-established and widely used character set standards
    means that their immediate needs got covered first. Vietnam, North
    Korea, and didn't have their character sets even under way at the
    time, so naturally their needs came later. There was not a general
    survey of "useful" CJKV characters (if that term even means anything)
    made before doing additions. If there had been, then nothing in
    IICore would be in plane 2.

    >> If the allocation of CJKV codepoints continues
    >> to be donr in this way, then for modern CJKV coverage will require
    >> not
    >> only BMP and plane 1 support but also, in the future, plane 3 suport.
    > (Should be plane 2, BTW.)

    No, he meant plane 3. If the current explosion of extremely rare Han
    characters continues, we'll have to start putting them in plane 3
    before long.

    > If it continues to be done in what way? They currently have teams of
    > experts sorting through the body of writing in Han ideographs, finding
    > new distinct ideographs, and identifying what most needs encoding.
    > Short of God handing the next set of Han ideographs down from Mt.
    > Sinai on stone tablets, I don't know what improvements can be made.

    There is actual considerable room for improvement.

    First of all, the experience of Extension C showed that there was a
    serious QA problem in the IRG. The amount of effort involved in
    identifying unifiable pairs entirely by hand left the whole process
    error-prone. This has largely been corrected with Extension D work.

    Secondly, the whole issue of "distinct ideographs" is getting nastier
    and nastier as the IRG has to deal with increasingly rare characters
    of uncertain provenance and meaning. So long as the IRG continues to
    treat each "distinct" ideograph as something that needs independent
    encoding, this is going to be a problem that plagues us.

    If, for example, we'd had the concept of variant selectors an
    established part of the standard during the Extension B work, the IRG
    could have saved literally thousands of code points which are now
    dedicated to obscure variants found in the Hanyu Da Zidian. If we
    abandon the idea that every distinct ideograph requires separate
    encoding, we could speed up the whole process, improve the quality of
    work, and -- most important -- make implementation much simpler.

    John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Thu Oct 25 2007 - 23:08:16 CDT