> John Cowan wrote:
> > In addition, in some applications those processing inefficiencies are
> > not present, thanks to the self-segregating nature of UTF-8. For
> > example, the Plan 9 "fgrep" program (which searches a stream of text
> > for the presence of one or more of a list of strings) need never convert
> > to UCS format at all; the strings are UTF-8 and so is the text, and
> > in fact the program looks the same as the corresponding 8-bit program.
> This is not completely true, fgrep to be Unicode compliant must
> deal correctly with combining characters. e.g.
> è ( <latin small letter "e" with grave "`" U00E9> ) is exactly
> equal to
> <latin small letter e U0065> <modifier letter low grave accent ' U02CE>
> So, grep should match <U00E9> with <U0065><U02CE> to be truly
> Unicode compliant.
Actually, it is U+0065 LATIN SMALL LETTER E + U+0300 COMBINING GRAVE ACCENT,
but the point is well taken.
This issue needs to be understood in the context of the development
of well-defined normalization forms for Unicode, however. See:
An fgrep would work correctly as John Cowan states on Unicode data
in the UTF-8 encoding form, as long as the data itself conformed
to one of the normalization forms (whether decomposed or composed).
It is only when arbitrary mixes of decomposed and composed representations
are present that things become more difficult for implementations.
The specified behavior of an fgrep could be defined for a particular
normalization form of Unicode data and thereby be compliant with
the Unicode Standard, without having to have runtime normalization
built into the fgrep process itself.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT