Re: Korean linebreking and UTR14(was Re: extracting words)

From: Jungshik Shin (jshin@mailaps.org)
Date: Mon Feb 12 2001 - 23:43:01 EST


On Mon, 12 Feb 2001, Mark Davis wrote:

Thank you for your answer.

> Asmus Freytag is the one to talk to; he can look into this.

Do you think I should contact him directly off-line? I thought he's on
this list now as well as back in March 2000 when I wrote about TUS 3.0
p. 124.

> On Mon, 12 Feb 2001, "Jungshik Shin" <jshin@mailaps.org> wrote:
> > On Sun, 11 Feb 2001, Mark Davis wrote:
> >
> > MD> Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
> > MD> recommended in my last message. The Unicode standard is online, as is

> > As I wrote when TUS 3.0 came out, I cannot help wondering where the idea
> > that leads to the following in the TR on line breaking (and what's written
> > about it in Chap 5o of TUS 3.0) came from.
> >
> > UTR14> Korean may alternately use a space-based (style 1) instead of the
> > UTR14> style 2 context analysis.

BTW, this clearly shows that what Rick McGowan wrote about 'either ... or'
in response to what I wrote about Korean line breaking rule (TUS 3.0
p. 124) in March 2000 is not right like I argued then. I'm sure he's
right about 'either ... or ' in English grammar but the intention of the
author is on my side if the author of UTR 14 is the same as that of the
part in question in TUS 3.0. I'm enclosing at the end of this message
a part of my message in response to him.

> > I'm very alarmed to find this 'misinformation' crept into the UTS and
> > UTR14 (now UAX #14). It would be nice if somebody in charge could get
> > this straightened.

This didn't make it in Unicode 3.1, either. What would be the best way
to get it addressed before next revision comes out? I'm afraid just
raising it on this list wouldn't be sufficient (of course, I should
have followed up more vigorously last year)

Regards,

Jungshik Shin

Enc.

1. Two messages of mine
   the first one : March 1, 2000
   the second one: March 2, 2000

From: Jungshik Shin <jungshik.shin@yale.edu>
Subject: Korean line breaking rules : Unicode 3.0 (p. 124)
Date: Wed, 1 Mar 2000 19:23:23 -0800 (PST)

On Sun, 13 Feb 2000, Kenneth Whistler wrote:

> Lest anyone feel unduly constrained, let me note that now that
> the editorial committee has closed the book, so to speak, on Unicode 3.0,
> all of you who are about to open the book for the first time should
> feel free to unleash your commentary on the text.

   I've just received my copy of Unicode 3.0 book, here goes
my first commentary.

   On page 124(section 5.15 Locatiing Text element boundaries),
the third paragraph has the following around the end:

U3.0> In particular, word, line, and sentence boundaries will need to
U3.0> be customized according to locale and user preference. In Korean,
U3.0> for example, lines may be broken either at spaces(as in Latin text) or
U3.0> on ideographic boundaries (as in Chinese).

  First of all, it's a great mystery to me how on earth this
strange notion of Korean having *two* different line breaking rules(as
opposed to one) crept into the expertise of non-Korean experts on Korean
and finally made it into Unicode 3.0 book and Unicode TR on line breaking.

  None of tens of Korean books on my bookshelves
I've just gone through breaks lines *exclusively* at spaces. All of them
break lines freely at *syllables*. Only places where lines are broken
*exclusively* at spaces(for Korean text) I can think of are completely
*broken*(as far as Korean line breaking is concerned) web browsers like
Netscape and MS IE and possibly earlier implementations of Korean LaTeX.
One may add to the list Korean text formatted by non-localized version
of 'fmt' (in Unix) as another example. To work around the problem caused
by these broken web browsers, some Korean web authors apply a simple
filter to insert <wbr> between every pair of Korean syllables to their
html files. To see what I mean, you may wanna take a look at
<http://photon.hgs.yale.edu/~jungshik/lb.html> and
<http://photon.hgs.yale.edu/~jungshik/lbscreenshot.jpg>

  Let me emphasize that line can be broken at any syllable boundaries
in Korean text (except for some obvious exceptions as applied in English
text: i.e. punctuation marks like '!', '?' cannot begin a line).

  Secondly, even in Latin scripts(well, at least in English) lines can
be broken not only at spaces but also at syllables(syllabic boundaries)
with hyphen. Only difference between Korean line breaking and English
line breaking is Korean doesn't need hyphen when lines are broken at
syllables because in Korean syllables form another visual unit a level
higher than alphabetic/phonetic letters(consonants and vowels).

  Thirdly, the expression 'ideographic boundaries' is not appropriate
'syllabic boundaries' or 'syllables'.

  Given these, I'd like to suggest the last sentence(that begins with
'In Korean, for instance...') be removed in the future edition because
Korean is NOT a good example case where there can be multiple line
breaking rules depending on user preference.

    Jungshik Shin

From: Jungshik Shin <jungshik.shin@yale.edu>
Subject: RE: Korean line breaking rules : Unicode 3.0 (p. 124)
Date: Thu, 2 Mar 2000 12:20:31 -0800 (PST)

On Thu, 2 Mar 2000, Rick McGowan wrote:

> I think that unfortunately both Hoon Kim and Jungshik Shin I think have
> *entirely* mis-interpreted the text. The text says:

> U3.0> for example, lines may be broken either at spaces(as in Latin
> U3.0> text) or U3.0 on ideographic boundaries (as in Chinese).

> The word "or" on the second line would never be interpreted as an "exclusive
> or", it is an "inclusive or". In "C Language" syntax, it means "A|B"; it
> does not mean "A^B".
U3.0> In particular, word, line, and sentence boundaries will need to
U3.0> be customized according to locale and user preference. In Korean,

 If it's written with that intention, what would you say about
the preceeding two lines? What's 'user preference' here? It implies
'exclusive or', doesn't it? In other words, it implies users may choose
to turn off 'B', doesn't it? (No Korean typesetter in her/his right mind
would do that.) If not, what's the point of taking an example of Korean
line breaking after that sentence about 'user preference'?

 On top of that, if that's your intention, it'd be clearer
to say 'lines can be broken on both spaces and syllable boundaries'(or
on any syllable boundaries including spaces), woudln't it?

> In that light, some of their previous comments should probably be
re-examined.

 Nonetheless, the last sentence of the paragraph in
question about Korean line breaking had better be removed(it's not
necessary at all in my opinion) to avoid possible/unnecessary confusion
it leads to (as is evident in Netscape's implementation of Korean line
breaking).

    Jungshik Shin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT