L2/02-392

Re: TR29 Feedback
From: Mark Davis
Date: 2002-11-01

The following is collected (unedited) feedback on TR29, sometimes with a few comments from me interspersed.


----- Original Message -----
From: "Murray Sargent" <murrays@Exchange.Microsoft.com>
To: "Mark Davis" <mark.davis@jtcsv.com>
Sent: Friday, October 11, 2002 18:18
Subject: RE: TR29 Text Boundaries


> I think Mynmar should be Myanmar. Thanks for putting in the comment
> about
" in Hebrew.
>
> Murray


----- Original Message -----
From: "Jungshik Shin" <jshin@mailaps.org>
To: "Mark Davis" <mark.davis@jtcsv.com>
Sent: Friday, November 01, 2002 09:58
Subject: Re: DUTR #29: Text Boundaries

> On Fri, 25 Oct 2002, Mark Davis wrote:
>
> Hello,
>
> I'm sorry I missed the dealine by one day. I planned to write to you
> yesterday, but didn't manage to(I didn't know the dealine is yesterday).
> It'd be nice if you could include my feedback in your presentation(?)
> at UTC meeting next week.
>
> > There is a UTC meeting on November 5. If there is any public feedback on
> > text boundary issues from
> >
> > http://www.unicode.org/unicode/reports/tr29/
> > (or the related http://www.unicode.org/unicode/reports/tr14/)
> >
> > that feedback should be in by next Thursday at the latest. This will be one of the last opportunities to make any changes before Unicode 4.0.
>
>  In section 3 on grapheme cluster boundaries, when you write about
> editing a grapheme cluster element by element, wouldn't it be nice to give
> an example or two ( editing Indic, Thai,Lao scripts and Hangul)?
>
> As for behaviors of 'delete key' and 'backspace key', in some cases
> they behave differently depending on context.  For instance, during the
> preediting (before committing), some input methods (for Korean. I believe
> Indian users may want something similar as Marco and I discussed last
> year) let them act on individual elements  while they act on grapheme
> clusters once committed.
>
>   In section 4(word boundaries), the paragraph following example 2
> reads:
>
> TR>   .... However, it's relatively seldome that a line-break boundary is
> TR> not a word-break boundary. One example is a word containing SHY. It will
> TR> break across lines, yet is a single word.
>
>   You might consider adding a second example taken from Korean. A line
> can break between two GCs (Hangul syllables) but the position between
> two GCs(syllables) is NOT a word-break boundary.
>
>  In the table for word boundary rules(table 2), 'FB' appears to be
> introduced without being explained before. Although it got obvious after a
> while, at first I wondered what it is. Perhaps, putting it inside a pair
> of parentheses as following would help:  'Treat a grapheme cluster (GC)
> as if it were ....  a the first character of the cluster (FB)'. The same
> might be the case of sot/eot although I didn't have problem recognizing
> them.
>
>  'Format' also appears in the table , but it's not defined in
> the boundary property values table (table 2 in section 4) but defined
> in table 3(in section 5). I guess that has to be moved up to table in
> section 4.
>
>
>   The second item in Notes following word boundary rules has the
> following :
>
>    - Where Hangul text is written without spaces, the same applied.
>
> I'm afraid this gives rather an unnecessary impression that Hangul text
> is written without space pretty often. It's very very rare, if done at
> all,  that *modern* Korean text is written witout space with or without
> CJK ideographs because orthographic standards of both Koreas are very
> clear that words have to be delimetered by space except for 'particles'
> that have to be put after the preceding words without space. ( 'Particles'
> attached to preceding words present a little challenge to search engines
> if they want to implement WWS for Korean. )
>
> Actually, what's more needed for Korean text in that Notes than what you
> have now  is CJK ideographs in Korean text have to be treated exactly the
> same way as Hangul precomposed syllables. That is, word boundary rule
> 14 has to be tailored for Korean to remove 'CJK ideographs' from 'any'
> on both sides and rule 5 has to be modified to read '(Aletter|Ideograph)
> x (Aletter|Ideograph)'.  As you realize, this simple modification of rule
> 5 does not work very well for some cases because 'I' and 'H' as 'I H'
> or 'H I' (H is for Hangul and I is for ideograph)  can be regarded as
> belong to separate words by some people. However, I guess that detail
> can be left out.
>
> Therefore, if you want to give a tayloring info. for Korean, I suggest
> that what I wrote in the previous paragraph(summarized in a sentence or
> two) goes in instead of what's in there now.
>
> Another note you might consider adding to Notes for section 4 is a
> tricky aspect of CR/LF. Some (plain) text editors(e.g. vi, emacs)
> automatically add(or can be configured to add ) CR/LF when the column
> width reaches the upper-limit(user-configurable). This could happen
> *inside* words(between two GCs but not inside GCs) as well as between
> two words. Some people also do that manually. Therefore, CR/LF can
> or cannot be word-break boundaries. Whether it is or not cannot be
> determined without context-analysis. This has to do with the fact
> that line-break boundaries are not necessarily word-break-boundaries.
> Text editors automatically adding CR/LF at a line-break boundary when
> the preset maximum column width is reached  cannot be blamed because
> it is doing what it's supposed to do, implementing line break rules.
> Another example is 'fmt' (or similar text utilities) usually found in
> Unix. The current implementation of such programs are NOT compliant to
> Unicode line break rules and only break lines at space-like characters.
> When users filter what they write through 'fmt-like programs' compliant
> to Unicode line-breaking rules, the result is quite challenging for
> implementors of word-boundary rules.
>
>
> BTW, I've just read UTR #14 (line breaking rules) and found that it
> allows line break between leading Hangul consonants(that is inside
> a single GC as defined in your draft TR). (see the explanation of CM
> in table 1). Later in detailed explanation of CM, the same problem is
> found (... while the initial combinining Jamos have the same property
> as Hangul syllables.) This is clearly in error. I'll try to write about
> this problem to the author, but you may also wish to bring up this issue
> if an opportunity arises at a UTC meeting next week.
>
>
>   Sorry again I haven't managed to give this feedback earlier, but
> I would appreciate you if you could take into account this when
> preparing for your next release of 'draft' or the final approved
> TR.
>
>    Best,
>
>    Jungshik

> I have several comments on
http://www.unicode.org/unicode/reports/tr29/
> Draft Unicode Technical Report #29 Text Boundaries
>
> 1.  I strongly disagree with this treatment of colon (:)
> as a MidLetNum, as stated:
> <quote>
> MidNumLet Any of the following:
> U+002E (.) FULL STOP (period)
> U+003A (:) COLON (used in Swedish)
> </quote>
> ...
> <quote>
> Certain cases like colon in words (c:a) are included even though they may
> be
> specific to relatively small user communities (Swedish) because they don’t
> occur otherwise in normal text, and so don’t cause a problem for other
> languages.
> </quote>
>
> I am not sure what if anything is considerted "normal text".
> Much of what I am reading and writing about these days either is XML,
> or discusses XML.  Thus I see a great many strings of the form
> foo:name where foo is a namespace prefix and name is an element name
> (or function name, etc.)  The colon is also used in non-XML
> files, e.g. to separate key:value pairs, e.g. color:red.
> I would certainly want words broekn at the colon.
>
> As the spec is currently written the colon is always a MidLetNum.
> I suggest changing the wording to state that this is
> a tailoring, only forSwedish.
> (This would be analagous to the wording in the very
> next sentence of the spec:
> <quote>
> For Hebrew, a tailoring may include a double quotation mark between
> letters,
> since legacy data may contain that in place of U+05F4 (״) gershayim. This
> can be done by adding double quotation mark to MidLetter.
> </quote>
>
> Also, I suspect that even in Sweden many people, e.g. those
> reading [about] XML, will also prefer words to be split at the colon.
> So they might want a way to turn that tailoring off.
>
> *** I will raise this point to the UTC.
>
>
> 2. We have had some discussions about whether these rules in TR 29
> are normative in any way, and I think we concluded that they are not.
> However you stated that it might be useful for an implementation
> to be able to claim some kinf of conformance anyway.
> I agree, but would like some clarification.
> I think many implementations will be close,
> but not exact in their conformance.  I think an implementation
> should be able to make one of these statements:
> 1. complies exactly with the example approach in TR 29
> 2. makes all these word splits, and possibly more
> 3. makes no other word splits, but may not make all of these
> 4. anything else
>
> *** The rules are not normative in the sense that conformance to the
> Unicode
> Standard does not require them. I agree with the options that you cite,
> and
> will bring those to the UTC.
>
>
> 3.  The recent widespread adoption of the naming convention
> usually called CamelCase has added the need for another
> kind of word split pattern.  For example in the string
> getWordPiecesFromName I want to allow the word splits
> marked by the vertical bars: get|Word|Pieces|From|Name
>
> An algorithm for processing these word split rulls
> in presented in Appendix C of
> The Java™ Architecture for XML Binding (JAXB)
> Public Draft 2, V0.75  October 4, 2002
> which I downloaded from http://java.sun.com/webapps/download/DisplayLinks
> after first going to http://java.sun.com/xml/downloads/jaxb.html
> I only quote a small excerpt, but it should be sufficient,
> and I do include all the patterns.
> <quote>
> An XML name is split into a word list by removing any leading and trailing
> punctuation characters and then searching for word breaks. A word break is
> defined by three regular expressions: A prefix, a separator, and a suffix.
> The
> prefix matches part of the word that precedes the break, the separator is
> not part
> of any word, and the suffix matches part of the word that follows the
> break.
> The
> word breaks are defined as:
> Table 3-1 XML Word Breaks
> Prefix   Separator Suffix      Example
> [^punct] punct+    [^punct]    foo|--|bar
> digit              [^digit]    foo22|bar
> [^digit]           digit       foo|22
> lower              [^lower]    foo|Bar
> upper              upper lower FOO|Bar
> letter             [^letter]   Foo|\u2160
> [^letter]          letter      \u2160|Foo
> </quote>
>
> I think it is reasonable for an implementation to
> introduce word split points where the case changes.
> Note especially that FOOBar is split FOO|Bar.
> I generally like this approach.
>
> However I suspect that this JAXB approach, which is based upon
> regular expressions, requires more power that
> the approach used in TR29, which seems to only
> look at two adjacent characters.  Actually I think one
> only needs an additional one character lookahead to
> handle the abbove patterns.
>
> *** The TR is not limited to a single character context before and after;
> the context can be more complex expressions. However, I do not believe
> that
> this is the correct approach to take in the default case. This would cause
> problems for real, natural language cases like McDowell, or vederLa
> (italian), or other cases of interior case changes in real words.
>
> What could be done is to point out that this is a possible tailoring for
> programming language contexts.
> ***
>
>
> 4.  I also provided detailed feedback to you on an earlier
> version of TR 29.  It might be nice to add my name to the list
> of people in the Acknowledgments section.
>
> *** This was a distinct oversight on my part. My apologies -- your name
> will
> definitely be on the next version.
>
>
> Hopefully helpfully yours,
> Steve


> -----Original Message-----
> From: Tolkin, Steve [SMTP:Steve.Tolkin@FMR.COM]
> Sent: Monday, October 28, 2002 1:04 PM
> To: 'Mark Davis'; Michael Rys; Paul Cotton
> Cc: w3c-query-operators@w3.org; w3c-i18n-ig; Andy Heninger; Cathy Wissink; Murray Sargent; Michel Suignard
> Subject: RE: FTTF Agenda Item 4, Review of Unicode boundaries
>
I had a private exchange with Steve and he suggested I post this to the list...

> 1.  I strongly disagree with this treatment of colon (:)
> as a MidLetNum, as stated:
>
<deleted/>

> Much of what I am reading and writing about these days either is XML,
> or discusses XML.  Thus I see a great many strings of the form
> foo:name where foo is a namespace prefix and name is an element name
> (or function name, etc.)  The colon is also used in non-XML
> files, e.g. to separate key:value pairs, e.g. color:red. 
> I would certainly want words broekn at the colon.
>
It is because of that that I'd like to see ":" not be a word boundary (in all cases).  In other words, when I tokenize some text that talks about XML, I'd like to get "xsl:apply-templates" as a single token because that "word" is more likely to be thought of in the writer's/querier's mind as a single "concept" rather than as "xsl" and "apply-templates". 

I see the case here as very similar to German compound words...computertastatur should tokenize as a single "word" on 1st pass...if you want to break it down into "computer" and "tashtatur" on a 2nd-level tokenization then that's fine.  Given a 1st-level tokenization that says "xsl:apply-templates" is a single "word" it is nice to have an algorithm that tells me this one word is probably composed of the two words "xsl" and "apply-templates" (the 2nd of which might also be decomposed into "apply" and "templates").  But I consider that fine grained tokenization to be a 2nd-level process.  The same applies to your "camelCase" examples.

Have you thought about it in these terms?

pvb


 

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄
 
----- Original Message -----
From: "Mark Davis" <mark.davis@jtcsv.com>
To: "Tolkin, Steve" <Steve.Tolkin@FMR.COM>; "'Biron,Paul V'" <Paul.V.Biron@kp.org>; "Michael Rys" <mrys@microsoft.com>; "Paul Cotton" <pcotton@microsoft.com>
Cc: <w3c-query-operators@w3.org>; "w3c-i18n-ig" <w3c-i18n-ig@w3.org>; "Andy Heninger" <heninger@us.ibm.com>; "Cathy Wissink" <cwissink@microsoft.com>; "Murray Sargent" <murrays@exchange.microsoft.com>; "Michel Suignard" <michelsu@microsoft.com>
Sent: Friday, November 01, 2002 09:13
Subject: Re: FTTF Agenda Item 4, Review of Unicode boundaries

> That is an interesting idea. BTW, the Chicago Manual of Style uses the
> following terms, which seem like reasonable terminology for us to use in
> describing these situations:
>
> open compound (stool pigeon)
> hyphenated compound (ill-favored)
> closed compound (notebook)
>
> What you are saying is that if the user gave {black-bird} as a search option
> (with Whole-Words Only turned on), then it would be possible for an engine
> to also search for {blackbird} and for {black bird}. I'm not sure that is
> always what is desired, however. After all, "blackbird" has a different
> meaning than "black bird". If I wanted to search for specifically
> {black-bird}, I'm not sure I always want the other hits. And if I wanted to
> search for all three, it would be easy to construct a query for them.
>
> I agree that these are interesting cases; in English essentially any
> sequence of words can be linked into a hyphenated compound adjective (CMS
> 6.32-6.42). Years ago, when I was integrating spell-checking into a word
> processer, I hit this case; there are some instances where each component of
> the hyphenated compound is a valid word, and others where only the whole
> compound is. So I ended up first checking the entire compound against the
> dictionary; then each component. If either case worked, I treated it as
> valid.
>
> But this whole issue is veering in the direction of root-word analysis,
> which would be a higher level process than TR29 is aiming for. TR29 would
> return three tokens "black", "-", "bird"; it would then be up to a higher
> level process to determine whether that sequence was to be handled in
> special ways or not. If such a process wanted to match these precisely, or
> to also match "black birds", "black birding", "blackened-birds", "melanotic
> avian", or other combinations, it can build upon the lower-level word
> divisions as it sees fit.
>
> Mark
> __________________________________
> http://www.macchiato.com
> ►  “Eppur si muove” ◄
>
> ----- Original Message -----
> From: "Tolkin, Steve" <Steve.Tolkin@FMR.COM>
> To: "'Biron,Paul V'" <Paul.V.Biron@kp.org>; "'Mark Davis'"
> <mark.davis@jtcsv.com>; "Michael Rys" <mrys@microsoft.com>; "Paul Cotton"
> <pcotton@microsoft.com>
> Cc: <w3c-query-operators@w3.org>; "w3c-i18n-ig" <w3c-i18n-ig@w3.org>; "Andy
> Heninger" <heninger@us.ibm.com>; "Cathy Wissink" <cwissink@microsoft.com>;
> "Murray Sargent" <murrays@exchange.microsoft.com>; "Michel Suignard"
> <michelsu@microsoft.com>
> Sent: Friday, November 01, 2002 08:28
> Subject: RE: FTTF Agenda Item 4, Review of Unicode boundaries
>
>
> > Summary:
> > I think that the needs of full text search will be best met by
> > having a "two level" approach to words.
> >
> > (N.B. I kept the subject, so it stays in gthe same thread,
> > but I am discussing a much bigger topic in this email.)
> >
> > Details:
> > I agree that we should think about "compound words",
> > as these frequently occur even in English.
> > The most common example is when a string is a hyphenated
> > word, e.g. "post-modern".
> > It is my opinion that a human searcher will want
> > a feature that lets a query string of "post-modern" match
> > any of these three strings in the document being searched:
> > "post-modern", "post modern", and "postmodern".
> > (Here I assume that "post=modern" was in a high level search
> > language syntax that is "compiled" down into the actual XQuery
> > full text language.)
> >
> > Other important cases of "compound words" in English
> > occur when the separator character
> > is a slash or backslash (in paths or filenames), or
> > other punctuation characters including period, colon, etc.
> > in library catalog numbers, part numbers, etc., etc.
> > I specifically want good treatment of these non-traditional
> > "words", as I believe technical text is very often searched.
> >
> > Here is a quick overview of my model of test.
> >
> > In my opinion the user model used by full text search
> > should be more sophisticated that a list of words.
> > Instead I think it should be a list of word-collections.
> > (The type of the collection might be set, bag, list, or array,
> > depending on the degree of refinement of search that the system
> > wants to support.  I can probably accept just using a set, which
> > is likely to be simplest even though it gives the least control.)
> >
> > Some words are "outer" words and some are "inner" words.*
> > (* Aside on terminology: I have tried used external and internal,
> > and primary and secondary, etc. but outer and inner seem
> > to work best.  I welcome suggestions for improvement.)
> > In the example above "post-modern" is an outer word
> > that contains the inner words "post" and "modern".
> >
> > Another advantage of this approach is that it provides
> > a principled way of handling alternate lexical forms,
> > e.g. all lowercase, singular, stems, etc.
> > as occurring in the same location as the original word.
> >
> > Therefore   would like Unicode TR-29 to make a distinction
> > between primary word boundaries (e.g. at spaces)
> > vs. secondary word boundaries (e.g. at hyphen).
> >
> >
> > I note that a two-level approach also seems to work well
> > for characters.  An outer character is the "grapheme"
> > that includes all the accents, as per NFC.   An outer character
> > contains a sequence of one or more inner characters,
> > aka codepoints, as per NFD.
> >
> > The two level approach also works well with sentences
> > (which can contain embedded sentences).
> >
> > Note that a two level approach is also already used for XML
> > documents. The document element is the outer element,
> > defining a scope boundary, and all the other elements
> > are inner elements.
> >
> >
> >
> > Hopefully helpfully yours,
> > Steve
> > --
> > Steven Tolkin          steve.tolkin@fmr.com      617-563-0516
> > Fidelity Investments   82 Devonshire St. V8D     Boston MA 02109
> > There is nothing so practical as a good theory.  Comments are by me,
> > not Fidelity Investments, its subsidiaries or affiliates.
> >