Contribution and Limits of the Use of Unicode in Approximate Pattern-Matching
Yves Lepage - ATR
In example-based methods in machine translation, typically, knowing that "he is young" is translated by "wakai" in Japanese, allows us to translate "he is not young" (just "not" is added) by relying on the previous sentence to construct "wakakunai" (just "i" is replaced by "kunai"). Hence, the need to retrieve similar sentences from collections of already translated sentences, for which approximate pattern-matching is used.
Now, machine translation implies different languages, and different languages may use different character sets, usually coded in different ways. This incidentally implies that the set of punctuation is different in different languages. Until now, we had two different implementations of our approximate pattern-matching algorithm for the two languages of our concern: English (ASCII) and Japanese (EUC). For Japanese, this had the disadvantage that the texts that we searched had to be consistently encoded in EUC only. However, Japanese people tend to use different character sets in writing in Japanese and to use different punctuation sets simultaneously.
Shifting to Unicode gave two advantages. Firstly, the problem of using different character sets in one language is eliminated. Secondly, the notion of punctuation becomes insensitive to languages. Punctuation stands somewhere between the logical and the physical structure of texts. But still, if for a language like English, punctuation helps in approximate pattern-matching for determining word boundaries, it is not the case for such a language as Japanese. The answer to this stands outside of the scope of character sets. This is a problem for natural language processing.
|When the world wants to talk, it speaks Unicode|
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS).
GMS is pleased to be able to offer the International Unicode Conferences under an exclusive
license granted by the Unicode Consortium. All responsibility for conference finances and
operations is borne by GMS. The independent conference board serves solely at the pleasure
of GMS and is composed of volunteers active in Unicode and in international software
development. All inquiries regarding International Unicode Conferences should be addressed
Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.
21 February 2002, Webmaster