RE: johab compound letters reference for Hangul? (3)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 21 2003 - 05:51:36 EST

  • Next message: Kent Karlsson: "RE: johab compound letters reference for Hangul? (3)"

    Doug Ewell wrote:
    > Philippe,
    >
    > > When looking at this document:
    > > http://std.dkuug.dk/JTC1/SC22/WG20/docs/n1051-hangulsort.pdf
    > > and its associated data file "n1051t-table-hangulctt6.txt"...
    >
    > Do you have access to a Web or FTP site, or some other place where you
    > could post these relatively long lists? (If you don't, I understand; I
    > didn't for many years while I was with CompuServe.)
    >
    > By including a long list of proposed decompositions in your message,
    > followed by commentary at the end, you run the risk that people will
    > skip over the list and miss out on the commentary.

    Well I'm involved with projects that include other Korean writers/readers
    with collections of texts that are very imperfectly mapped in Unicode, and
    badly rendered with most fonts, despite they are correctly represented.

    My idea was to offer better accessibility to these texts, and I do think
    that Unicode made errors to encode Hangul twice, but also Korean standards
    that used the Wangsung set and later the Johab set.

    For many implementers, they feel that the precomposed Hangul syllables will
    be enough but necessary to support the Hangul script. Due to its size in the
    Unicode space, they are reluctant to include this support. However Korea is
    the most Internet-connected country in the world, and the need to have
    Unicode texts correctly supported for tens of millions of users is urgent.

    We could serve them more easily, by describing more precisely in Unicode the
    structure of their script that they all have learned at school. The
    artificially built subset of the script found in Unicode, and KSC5601
    desserves this need to have good support for them, including with good
    typography (if you look at various Korean web sites, you'll see that the
    lack
    of support for good typography has a consequence: most sites use a lot of
    bitmaps to represent text, even if that breaks accessibility for blind
    users, that may simply be able to represent the Basic jamos with very simple
    Braille patterns.

    It is also needed to allow better support in text editors in order to offer
    users a way to edit letters separately within Hangul clusters. I do think
    that a good Korean editor should have a display mode where Hangul clusters
    are represented only by Basic Jamos presented as an alphabet (for example
    with taller/uppercase glyphs for Choseongs, Jungseongs and Jongseongs being
    presented like lowercase letters in Latin).

    Then it would be up to the text editor to automatically recompose first the
    Johab compound jamos to get the prefered Unicode NFD form, and optionnally
    use the algorithmic composition of Hangul syllables.

    Browsers should also be able to decompose syllables and compound Johab
    jamos into Basic jamos, in order to use simpler fonts that are only defined
    with these Basic Jamos. Then, if browsers implement the 2D composition
    model for Hangul (which is defined only in terms of Horizontal or Vertical
    property attached to Basic letters) they could recreate the layout of
    Hangul syllables. Basic fonts could also be easily extended to fonts
    supporting the whole standard set of clusters (I prefer the term cluster
    to the term of syllable for the Unicode or Johab or Wangsung subsets of
    valid Hangul syllables), by adding a default composition routine. Fonts
    could also be more easily hinted within the reduced set, with the
    advantage that this hinting would be inherited in clusters.

    Finally the capability of performing full text search in Hangul is too
    limited for now and not easily interoperable. I have been told that
    Google.kr already performs this decomposition of Hangul texts for
    pages encoded with Korean standard charsets, in order to increase the
    number of good hits detected in pages. If Google needs that, I think that
    many users will also need that too, and I would prefer that this
    decomposition of compound jamos be officially described with a coherent
    set of decompositions (which, in my opinion, should become canonical,
    except that the Unicode NFD and NFC forms would not be modified for
    existing correctly composed texts: the NFD form will use the Johab
    compound letters, or recompose to them)

    There are two new files to add in the UCD: one that defines these
    extra "canonical Johab" compositions into Basic Jamos (I would call
    it "HangulBasic.txt") and an extension to the CharacterProperties
    to assign the horizontal or vertical stack composition of a Basic
    or compound jamo (intended for renderers that choose to display
    the 2D layout of syllables):

    - if two clusters have horizontal layout (normally these jamos are taller
    than wide and include most often a long vertical stroke), they are
    stacked side-by-side from left-to-right, and their resulting composition
    also has horizontal layout.
    - same thing if the first cluster is vertical and the second horizontal.
    - if two clusteres have vertical layout (normally these jamos are taller
    than wide and include most often a long horizontal stroke), they are
    stacked from top-to-bottom, and their resulting composition
    also has vertical layout.
    - same thing if the first cluster is horizontal and the second vertical.
    In summary:
            H + H -> H,
            V + H -> H,
            V + V -> V,
            H + V -> V
    i.e. the layout of a compound is determined by the last jamo in the
    cluster, something that is extremely simple to understand and implement
    efficiently.

    Of course this second set of layout property is not required for renderers
    as the presentation as a 2D syllabic cluster is optional. Hangul already
    supports "Half-width" presentation, made only of Basic letters, where the
    differentiation between Choseongs or Jongseongs can be a matter of style:
    bold/light, tall/x-height (similar to lettercase in Latin), ...
    Or even not differentiated in texts (similar to what was done with Wangsung
    where the reader implicitly rebuilds delimitations between syllables using
    his linguistic and phonetic knowledge of Korean, exactly like for Latin
    text readers). Fonts could be built to support either presentation styles.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Sun Dec 21 2003 - 06:35:59 EST