Re: unicode on Linux

From: Jungshik Shin (jshin@mailaps.org)
Date: Sat Oct 25 2003 - 07:16:56 CST


Stephane Bortzmeyer wrote:

> Kernel
> 1) File names in Unicode: no (well, the Linux kernel is 8-bits clean
> so you can always encode in UTF-8, but the kernel does not do any
> normalization

  As other have written, I don't think kernel has any business with
normalization (although on Mac OS X, apparently the kernel does).

> the applications do not expect UTF-8, for instance
> ls sorts alphabetically but dot not know Unicode sorting).

  Does 'ls' sort filenames when they're in ISO-8859-1?

> 2) User names: worse since utilities to create an account refuses
> UTF-8.

  Yeah, this should be fixed.

> Applications
>
> 3) grep: no Unicode regexp

  I agree that grep and many other text utilities need to be updated to
honor the locale (LC_COLLATE, LC_CTYPE and others). With glibc 2.2.x or
later and gnulib, it shouldn't be as hard as before. In addition, you
always have perl and python to turn to (both support Unicode very well).
 Also note that I wrote about 'honoring the locale' instead of
supporting UTF-8, by which I want to emphasize that it's not just UTF-8
but also legacy character encodings that are not supported by grep and
other GNU textutils used on Linux.

> 4) xterm (or similar virtual terminals): No BiDi support at all

   mlterm does. It even supports Indic scripts. (xterm supports Thai
script and Korean script, though). Do you have any terminal emulator
running on other platforms that do BIDI well?

> 5) shells: I'm not aware of any line-editing shell (zsh, tcsh)
> that have Unicode character semantics (back-character should move one
> character, not one byte)

  A recent version of bash (to be precise, GNU libreadline it uses) has
no problem with UTF-8 handling (although it does not do well with
combining character sequences. that is, it doesn't have a notion of
grapheme clusters)

> 6) databases: I'm not aware of a free DBMS which has support for
> Unicode sorting (SQL's ORDER BY) or regexps (SQL's LIKE).

   Why is the OS to blame that there's no FREE DBMS that supports
Unicode collation and regular expression? Needless to say, there
are commerical DBMS' that do both and run on Linux.

> 7) Serious word processing: LaTeX has only very minimum Unicode

  Well, Linux distributions come not only with LaTeX/TeX but also with
Lambda/Omega, their Unicode cousins. Opentype font support in
Omega/Lambda is not there, yet, but Indic scripts and other complex
scripts (e.g. Korean script) can be typeset with Omega/Lambda. Anyway,
LaTeX/Lambda are not for word processing. If you want a word processor,
you have to try openoffice/staroffice, Abiword, kwrite, and so forth
that support Unicode well.

> Also, many applications (exmh, emacs) are ten times slower when
> running in UTF-8 mode.

  Emacs' adoption of Unicode has been moving frustratingly slow and the
performance may be slower in UTF-8 mode than otherwise(actually,
there are a couple of diffent UTF-8 implementations for Emacs and I don't
know which one you tried), but Vim is not. The reasom Emacs is that much
slower is likely to do with the fact UTF-8 support is retrofitted to the
ISO-2022-based infrastructure of MULE. Other applications on Linux do NOT
have to carry that baggage so that they are not any slower in UTF-8 mode
than in legacy encoding. Actually, they should be faster in UTF-8 because
most modern toolkits/applications for Linux are based on Unicode and in
UTF-8, there's no (if UTF-8 is the internal representation as in gtk) or
little (if UTF-16 is used internally as in Qt) overhead for the codeset
conversion. Pls, don't extrapolate from just a couple of bad examples.

> At the present time, using Unicode on Unix is an act of faith.

  Well, I thought this is 2003. You wrote as if it's 2000. You sound
like a one-time 'convert' who lost one's faith a long time ago and has
never come back to see how much has changed since.

Moreover, in the above sentence, that you used 'Unix' instead of Linux,
Sun and IBM engineers who worked on UTF-8 locale support on Solaris and
AIX may take an offense at your remark. I can't say much about AIX
except that it has supported UTF-8 locales as long as Solaris has. As
for Solaris, Solaris 7 (released in mid-1990's) and onward don't even
have some remaining problems Linux still have (i.e. grep/sed/ls/sort and
other textutils not honoring the locale in their handling of text
streams).

>> Default charset for recent versions of some popular distributions.
>
>
> Yes, RedHat changed the default charset to Unicode without thinking
> that text files were no longer readable.

  Unreadable? What is iconv(1) for? Perhaps, RH should have included a
nice GUI migration tool (as a part of the RH 8/9 installation disk)to
let clueless end users(Mom and Pop) convert all their text files in
legacy encodings to UTF-8 along with a similar tool for the filename
conversion.

   I'm not saying that using Unicode (mostly in the form of UTF-8) on
Linux is as seamless as I wish it to be (there are a number of issues
I want to fix or to see fixed that you didn't mention possibly because
you wrote mainly from the Western European point of view), but it's
certainly not so difficult as you painted it to be.

  Jungshik



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST