Re: locale-*independent* vi editor supporting UTF-8

From: Jungshik Shin (jshin@pantheon.yale.edu)
Date: Tue Dec 08 1998 - 14:15:39 EST


On Mon, 7 Dec 1998, Otto Stolz wrote:

> Am 1998-12-03 um 7:46 h hat odonnell@zk3.dec.com geschrieben:
> > According to the XPG4 vi man page, the current
> > locale controls many aspects of vi's behavior, including the
> > way strings are parsed into characters,

  There's absolutely no reason not to make use of UTF-8 support from the
underlying OS if available. However, a vi writtent that way is NOT
portable across the platform (yet). My original question was about a
*quick and dirty* solution of UTF-8 enabled vi working without any help
from the OS when UTF-8 support is not readily available, which is still
the case in most Unix and Unix-like OS. So making a UTF-8 enabled vi
independent of the OS locale support is not a long-term solution(that
should be adding ground-up support for Unicode/UTF-8 to OS') for more
wide deployment of Unicode(and UTF-8) but a short-term patch to enable
as many people (whatever OS they may use) as possible to edit UTF-8
documents with one of the most popular editors (in Unix).

> UTF-8 (cf. <http://czyborra.com/utf/#UTF-8>) uses 1 through 3 bytes per BMP
> character (1 through 4 bytes per Unicode character). In order to "parse
> strings into characters", the processing program must undo the UTF-8

  Sure, vi should know about UTF-8. How it should get necessary
information depends on the degree of UTF-8 support by the OS it's
intended to be run under. If the OS support is NILL, it should do it by
itself.(ucdata would be certainly of help in this respect).

> Vi, as any other program, has to know about this encoding, in order to
> perform correctly; a classical, 8-bit based, Vi implementation would not
> even get the cursor position right, with UTF-8 encoded data. At the very
> least, Vi will have to take the UTF-8 mechanism into account when counting
> characters and calculating cursor movements.

   This paragraph(on which I fully agree with you) does not necessarily
lead to your conclusion in the following sentence.

> Hence, I cannot understand how
> > It's easy to have a vi that processes UTF-8-encoded data.

   I'm not sure how easy it would be to write an UTF-8 enabled-vi
WITHOUT HELP of locale support offered by the underlying OS(if there's
the OS locale support for UTF-8 as are there for other multibyte
encodings, it's really easy to write vi for UTF-8 as UTF-8 had nothing
more mysterious than EUC-JP, EUC-KR, Big5, Shift_JIS, EUC-TW,EUC-CN,
etc). However, as I wrote in the first(or second) message of the thread,
there ARE several implementaions of vi(clones) for multibyte encodings
of CJK(EUC-JP, EUC-KR, Big5, Shift_JIS, ISO-2022-JP, JOHAB etc) that
work whatever locale they run under(of course, they need a terminal
emulator to support those encodings and there are terminal emulators
satisfying the requirement). Do you believe all those CJK people
outside their countries have lived without vi supporting encodings
specific to their languages?(Outside East Asia, most Unix system admins
don't bother to install CJK locales even if they're available free from
the OS vendor so that locale-*dependent* vi for those encodings are of
NO use). They work without any help from the OS because they implement
multibyte-encoding handling not relying on any *w* string/char
functions. Given this and the fact that UTF-8 is not much different from
those multibyte encodings(if we can put aside difficult issues like BIDI
and comibing characters for the moment), it should be, in principle,
possible to write a vi not depending on the OS locale support as long as
we have a terminal emulator for UTF-8.

  One remaining issue to resolve is which charcters are half-width and
which are full-width. In CJK cases, it's clear-cut, but in UTF-8, it's
not so. I guess it would be all right if vi and terminal emulator can
agree with each other on the character width. Perhaps, the following rules
of thumb might work:

  1. All characters "derived" from ISO-8859-x and
     other single byte charcaters sets(and related) are half-width
     even if they're also defined in CJK source standard
     (e.g. U???? copyright symbol would be half-width
      even thought it's full-width in CJK environment)
  2. CJKV UniHan and Korean Hangul(pre-composed) are full-width
  3. All characters in CJK compatibility blocks are full-width
   ..... and some more
  

   Of course, too much can't be expected from such an implementation.
For instance, the regular expression wouldn't cover much outside vanilla
US-ASCII range. However, such basic features of vi as
character/word/line counting(in various commands) and cursor movement
could be supported.

> In order to process data in various encodings, such as ISO 8859-1, UTF-8,
> and Unicode (UTF-16), a programm has to know about the encoding of the
> actual data. Hence, I cannot understand how a program, such as Vi, could
> work with a locale that does not cover the encoding.
> Please, explain.

  I hope this time I succeeded in conveying what I meant.

     Jungshik Shin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT