Hangul syllable boundary and opentype fonts/rendering

From: Jungshik Shin (jshin@mailaps.org)
Date: Mon Apr 07 2003 - 01:07:39 EDT

  • Next message: Abdij Bhat: "UNICODE-non-clashing ASCII character needed"

    Note to those on the Unicode list. we've been discussing opentype support
    of Korean script on the opentype list, but there are some issues that
    can be better answered on the Unicode list so that I'm copying this
    to the Unicode list as well. Because the opentype list archive is not
    available for the public, I put up two of my previous messages (that
    quotes Paul's reply) at

      http://jshin.net/i18n/korean/ot.msg.1.txt
      http://jshin.net/i18n/korean/ot.msg.2.txt

    It'd be great if your can share your insights on the issue.

    Regards,

    Jungshik

    Paul Nelson (TYPOGRAPHY) wrote:

    Dear Paul,

    Thank you for your interest in the discussion.
    I wish I had contacted you much earlier than I did.

    >First, I need to begin by stating that Unicode is not a linguistic
    >encoding, nor is it supposed to be.

      It is certainly not for Hangul as it is now, which I regard as
    very unfortunate, but I'm not sure if it's not supposed to be. Numerous
    threads on the encoding of Indic (and related) scripts in South Asia
    and SouthEast Asia on the Unicode list appear to indicate that UTC
    and ISO JTC1/SC2/WG2 have been trying to make the encoding model
    for them in Unicode/ISO 10646 reflect the (linguistic) principles
    of those scripts as faithfully as possible (or whenever
    it makes sense).

    > With that in mind, it is important
    >to identify the fact that we are constrained to work within the bound of
    >the Unicode/ISO character encoding specifications.
    >
    >Here is data from the Unicode site:
    >
    >1100;HANGUL CHOSEONG KIYEOK;Lo;0;L;;;;;N;;g *;;;
    >1101;HANGUL CHOSEONG SSANGKIYEOK;Lo;0;L;;;;;N;;gg *;;;
    >

    I'm well aware of this as I at least implied in my two previous messages and I
    think requesting to make *all* Jamos in U+1100 block atomic (regardless
    of whether they're clusters or not) was one of several blunders made
    by South Korean standard body.

    JS> Not supporting composition of cluster/complex Jamos made up of
    JS> simple/basic Jamos (for instanace, U+1101 is nothing but a 'presentation
    JS> form' of 'U+1100 U+1100') just because they're given separate codepoints
    JS> as presentation forms in Unicode is squarely against the principles of
    JS> Korean script as envisioned by its creators in the 15th century. [1].
    JS> Those complex/clusters jamos got encoded (e.g. U+1133 =
    JS> U+1109 U+1107 U+1100) not because they're any way superiror to or more
    JS> fundamental than those NOT separately encoded (e.g. U+1105 U+1107
    JS> U+1107). They were just *lucky* to be spotted by Korean linguists
    JS> when the list was compiled and submitted to ISO/IEC JTC1/SC2/WG2 in
    JS> early 1990's. [2]

      I wish UTC had not been so eager to honor its request especially
    considering that it's now impossible to mend this problem because
    UTC committed itself NOT to modify the canonical composition/
    decomposition for any existing characters. However, this issue can
    be partly resolved/worked around by introducing tailored (canonical)
    (de)composition that is on the table for UTC if I understand it correctly.
    BTW, Kent Karlsson wrote a paper on the issue.
    (Kent, have you put your paper(draft) somewhere on the net? Could
    you give us the URL if you did? )

    >Because of this data, I cannot state that the form of U+1100 U+1100 will
    >result in U+1101. As we see above, the U+1101 has no decomposition form
    >specified. Thus, there is no need to make an engine to support this.
    >Additionally, I would argue that I cannot make an engine to support this
    >form as you suggest as I would have to violate Unicode properties to do
    >so.
    >

     Well, I'm afraid UTC got Unicode sort of 'in conflict with' (not
    exactly a conflict but a point to be made clearer) Unicode
    itselt by NOT making Jamo clusters cannonically equivalent to
    sequences of basic/simple Jamos.
    In 3.11 of Unicode 3.2, Hangul syllable is defined as

          S := (L+ V+ T* | L* S1 V*T* | L* S2 T*)
         where S1 is LV type syllable and S2 is LVT type syllable.

    Now, it's rather silent as to how sequences likes 'U+1109 U+AC01'
     ( = U+1109 U+1100 U+1160 U+11A8) are supposed to be rendered. If
    'U+1109 U+1100' = U+112D, there'd be no issue at all. Unfortunately,
    U+1109 U+1100 is not canonically equivalent to U+11AD and it'll
    never be because NFC/NFD were frozen. However, I think rendering
    engines/layout libraries like Uniscribe and OT fonts can take some
    liberty to interpret and best match what users intend when they come
    across 'U+1109 U+AC01'. I also believe that this is also more in the
    spirit of Unicode 3.2 section 3.11 and UTR #29 according to which
    'U+1109 U+AC01' is regarded as forming a *single* grapheme (syllable in
    this case) instead of two graphemes. In other words, 'U+1109 U+AC01'
    has to be treated and rendered as a unit. So, if it's followed by
    'U+302E' (Hangul Single Dot Tone Mark), U+302E has to be put to the
    left of the cluster 'U+1109 U+AC01' (=> U+1109 U+1100 U+1160 U+11A8 =>
    U+112D U+1160 U+11A8) instead of between U+1109 and U+AC01 (to
    the left of U+AC01).

    >As you have pointed out, this mades Unicode not handle the Korean script
    >as it was envisioned by its creators in the 15th century. Unicode is not
    >specifically designed for the purpose of handling scripts in the way in
    >which they were designed to begin with, but to be able to correctly
    >represent text in a Uniform manner that allows for unambiguous exchange
    >of data.

      The way I think Hangul should have been encoded (closely
    matching the intents of its inventors) would have paved a lot cleaner
      way for a uniform representation of Korean scripts than the current
     Unicode does. Most, if not all, of blames for this problem have
    to be taken not by UTC but by my government and its incompetency and
    short-sightedness in stark contrast with the foresight and competency
     of Indian government that came up with infinitely better encoding
    models for Indic scripts (in ISCII and Unicode/ISO 10646)
    which are similar to Korean script in a number of aspects. Anyway,
    we have to live with the reality and, IMHO, a possible way to
    work around it is introducing tailored composition/decomposition
    that is optional on the paper but is implicitly semi-official/required.
    [1]

    >In your example, one person might type U+1100 U+1100 while another types
    >U+1101. This would lead to confusion in the "correct" manner to
    >represent the encoding of the shape that looks like U+1101. By following
    >Unicode as it exists (with its imperfections) we have the ability to
    >support the open exchange of text and the digital recording of text that
    >we can preserve into the future.

      As you know very well, there are multiple "correct" ways to represent
    identical characters/letters in Unicode and the way to solve
    problems arising from multiple representations is canonical
    composition/decomposition. Unfortunately, for Hangul, canonical
    composition of complex/cluster Jamos out of basic/simple jamo
    sequences is missing, but I hope that the issue will be partly solved by
    introducing tailoring of composition/decomposition as mentioned above. In
    the meantime, what I suggested is NOT to make MS products(Uniscribe
    in particular) generate text (not compliant to Unicode as it is now)
    BUT to make them generously accept 'decomposed cluster/complex Jamos'
    and treat them as their corresponding 'precomposed' forms when they're
    coming from outside. This would not hemper, in any way, open exchange of
    pre-1933 orthography Korean text that all of us are pursuing. Moreover,
    putting this additional 'composition' into the OT layout table (along
    with some other places along the stream if necessary) of OT fonts would
    not decrease but increase the chance of getting the identical rendering
    results across platforms where OT fonts are used.

    >UTR #29 is a subject that I will not address at this point. I have not
    >studied it with regards to Korean, but would not be surprised if there
    >are some errors present.

      Are you saying that there are some errors in UTR #29. Well,
    I'm not saying that it's perfect (all of us are prone to make
    mistakes). However, it's NOT an error by any means for UTR #29
    to say that sequences such as 'U+1100 U+AC00' are a single grapheme
    instead of two. They have been always considered a single grapheme
    since Unicode 2.0 (the earliest Unicode standard
    for which I (used to) have a hardcopy.) Unicode 3.0 might not have
    been as clear as possible about this(I think it was clear enough),
    but any remaining doubt was cleared up by Unicode 3.2 section 3.11 and
    UTR #29.

    > It is important to know that we do not consider
    >that the precomposed Jamo characters (like U+AC01) are valid inputs for
    >composing an Old Hangul jamo. Thus, from my perspective there will
    >*always* be a syllable boundary before and after each precomposed Jamo
    >form.
    >

    Well, whether you consider it valid or not, UTR #29 and Unicode 3.2
    section 3.11 are pretty clear that there's NO
    syllable boundary in sequences like {L LV}, {L LVT}, {LV V}, {LVT T}
    while the document at
    http://www.microsoft.com/typography/otfntdev/hangulot/default.htm
    and you consider them as two 'syllables'(graphemes) with the syllable boundary
    between L and LV/LVT, LV and V, and LVT and T. Considering them
    as two graphemes is a clear violation of Unicode standard you
    want to abide by.

    >I find this discussion very interesting because some of the behaviors
    >you are describing put the output in the format of Old Hangul
    >combinations in conflic with the expected behavior of modern Hangul as
    >some of the composable forms are written on full spaces in an uncombined
    >manner.

    I'm not sure I'm following you here. Could you give an example sequence
    with Unicode code points?

    > This is a significant issue that has a huge impact on our
    >customers. Perhaps the way this could be handled is to specify that the
    >ZERO WIDTH JOINER must be used between characters that should be
    >combined. That way a user could type modern Hangul as they can now with
    >correct results, but still have the option of forming Old Hangul
    >clusters using the same set of characters.

      No, I don't think there's any need for ZWJ for Hangul. This
    is not just a theoretical speculation but I do have two actual
    implementations (of UTR #29 and Unicode 3.2 3.11 syllable boundary
    analysis) and so far I haven't found any problem with them. As I
    explained in my message and UTR #29 and Unicode 3.2 section 3.11 do
    likewise, there's absolutely no need to use ZWJ for Hangul text.
    Syllable boundaries can be clearly identified without using ZWJ at
    all. I'm pretty sure experts on UTC jump out of their seats immediately
     on hearing that ZWJ is necessary for Korean Hangul.

      If they want to break U+1100 and U+AC01, they have to enter
    U+115F (Hangul Jamo Choseong Filler) between U+1100 and U+AC00
    to turn U+1100 into a proper syllable (U+1100 U+115F).

    >It would be wonderful if you can help us understand how Old Hangul can
    >work the best within the constraints of Unicode in which we must work.

    I'm more than willing to help you with that. However, we have to
    stand on the same ground as to what the constraints of Unicode are
    before that.

    As I wrote above, I believe there's a bit of 'internal inconsistency' in
    Unicode and I'm hoping that that problem will be resolved before Unicode
    4.0 comes out by introducing tailoring of composition/decomposition for
    Hangul Jamos that will narrow the gap between Korean scripts as created
     by its inventors in the 15th century and Unicode encoding model of
    Korean script.

    >We also need to understand how to allow the majority of users today who
    >use modern Hangul and get the results they expect for current usage
    >while keeping it possible for scholars and others to continue to
    >represent Old Hangul with the understanding that it would be in a manner
    >different than the Hangul script was originally conceived.
    >

    Perhaps I failed to make it clear, but what I suggested to you does not
    require the majority of users to change anything they've been doing.
    They can keep working exactly the same way as they do now. What I suggested
    is not to replace something in the current practice with something else
    but to add to what's being done. On the other hand, having 'standard'
    libraries that provide the (canonical) decomposition of cluster/complex
    Jamos into basic/simple Jamo sequences and embedding a similar
    'appartus' into OT fonts would be a great boon for Korean
    linguists who sometimes need to work at the 'genuinely atomic' level.

    Jungshik



    This archive was generated by hypermail 2.1.5 : Mon Apr 07 2003 - 02:03:54 EDT