Korean linebreking and UTR14(was Re: extracting words)

From: Jungshik Shin (jshin@mailaps.org)
Date: Mon Feb 12 2001 - 16:58:33 EST

On Sun, 11 Feb 2001, Mark Davis wrote:

MD> Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
MD> recommended in my last message. The Unicode standard is online, as is the
MD> TR. Both can be found by going to www.unicode.org, and selecting the right
MD> topic. The TR in particular discusses the recommended approach to line break
MD> in great detail.

As I wrote when TUS 3.0 came out, I cannot help wondering where the idea
that leads to the following in the TR on line breaking (and what's written
about it in Chap 5o of TUS 3.0) came from.

UTR14> Korean may alternately use a space-based (style 1) instead of the
UTR14> style 2 context analysis.

UTR14> 1. Korean uses either implicit breaking around
UTR14> Hangul and ideographs or uses spaces. Reference [1] shows
UTR14> how this can be elegantly handled by the second or third
UTR14> method. Only the intersection of ID/ID, AL/ID and ID/AL
UTR14> are affected. For alphabetic style line breaking, breaks
UTR14> for these four cases require space, for ideographic style
UTR14> line breaking, these four cases don't require spaces.

where style 1 and style2 are defined as

UTR14> 1. Western (spaces and hyphens are used to determine breaks)
UTR14> 2. East Asian (lines can break anywhere, unless prohibited)

Let me make it clear that virtually NO books published in Korean uses
space-based (style 1) line breaking rule. Style 2 line breaking rule
is *exclusively* used for modern Korean text no matter what some broken
word processors for Korean offer as an alternative to style 2 and what
some web browsers (e.g. Netscape 4.x. Mozilla fixed this problem) do.

I'm very alarmed to find this 'misinformation' crept into the UTS and
UTR14 (now UAX #14). It would be nice if somebody in charge could get
this straightened.


Jungshik Shin

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT