Re: Re: Unicode, Cure-all or Kill-all?

From: Martin J Duerst (
Date: Wed Aug 14 1996 - 06:47:09 EDT

Geoff Back wrote:

>With all these discussions about whether characters that have separate
>codepoints and identical glyphs should have been merged, there is one factor
>that no-one has mentioned. Semantic analysis.

Non-experts may get the idea that because CJK ideographs have meanings,
this may help semantic analysis. It turns out that this is only marginally so,
if ever, the advantage of having a text in ideographs vs. having the same text
in phonetic spelling, is probably between 0.1% and 1% for real semantic analysis.
This does not mean that reading ideographic text is not a big advantage for
human users; as they do most of the semantic analysis with real ease, it
can make a big difference for them.

>Taking the example below, if I am performing semantic analysis on raw Unicode
>text for, say, spell checking, there is a clear and vital difference between
>Latin "A" and Greek "A". This is where the Unicode definition of a character
>comes into it's own.

I don't see the point here, for several reasons:
- A Latin or Greek word on average will contain several letters that don't
        appear in the other alphabet, and therefore distinction between
        Latin and Greek words will not be a problem.
- For other pairs of languages, e.g. English and French, distinguishing
        Latin and Greek "A" doesn't make any difference.
- Semantic analysis could indeed improve spell-checking somewhat
        (e.g. to find out where "from" should have been "form" or so),
        but I don't know any case where this is indeed done, or how this
        would relate significantly to the distinction between Latin and Greek "A".

Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT