Re: unicode on Linux

From: Edward H. Trager (ehtrager@umich.edu)
Date: Tue Oct 21 2003 - 09:32:28 CST


On Tuesday 2003.10.21 14:43:43 +0200, Stephane Bortzmeyer wrote:
> On Mon, Oct 20, 2003 at 10:14:22PM +0200,
> Stefan Persson <alsjebegrijptwatikbedoel@yahoo.se> wrote
> a message of 23 lines which said:
>
> > >Just wondering if anybody knowss how unicode is on Linux?
> > >
> > Very good support.
>
> Very optimistic.
>
> Kernel
> *****
>
> 1) File names in Unicode: no (well, the Linux kernel is 8-bits clean
> so you can always encode in UTF-8, but the kernel does not do any
> normalization and the applications do not expect UTF-8, for instance
> ls sorts alphabetically but dot not know Unicode sorting).
>

I think there can be big debates about
whether a Linux (or any *nix kernel, for that matter) has any business normalizing
file names. Personally I think Unicode normalization is not the kernel's business.
This is better left to the userland applications.

Are you sure about ls? ls should sort UTF-8-encoded file names in raw Unicode order,
n'est-ce pas? Of course, that may not be what one wants! Take Chinese for example:
there are many different methods for sorting Chinese used in Chinese dictionaries
(phonetic, radical+stroke count, four corner method, ... ). The order of the unified
Hanzi/Kanji in Unicode used the Kangxi (stroke-order based) dictionary as a primary
basis, and the Dai Kanwa Ziten as a secondary basis. So the result is a hybrid Chinese
plus Japanese ordering. Plus, the CJK Joint Research Group had to deal with the placement
of all of the simplified Chinese characters that were not listed in the historical KangXi
dictionary (originally compiled between 1710-1716). It is nice that Unicode in some
sense preserves the great tradition first established by the KangXi ZiDian, but that sort order
may not be what any one modern native Chinese, Japanese, or other user needs or wants for
his particular purpose. Similar stories exist for other scripts and languages.

> 2) User names: worse since utilities to create an account refuses
> UTF-8.
>
> Applications
> ************
>
> 3) grep: no Unicode regexp

What about ICU's regexp package?
(http://oss.software.ibm.com/icu/userguide/regexp.html)
You should be able to use ICU on *any* platform.
Linux does not yet having a Unicode grep
and to my knowledge Windows does not yet have grep at all ...

Most of my pattern searching and string manipulation needs
-- which includes searching through documents and data encoded in UTF-8 --
are fully met using egrep and Perl (I happen to use Linux, but of course
Perl is available on every platform). So it is clear that everything
depends on evaluating one's needs, and then figuring out which software
will meet those needs. There is now enough Unicode-aware software on Linux
to meet many people's needs. See http://eyegene.ophthy.med.umich.edu/unicode/.

>
> 4) xterm (or similar virtual terminals): No BiDi support at all

Use mlterm instead. It has BiDi support and support for complex text
layout as required for Arabic, Indic, and Indic-derived scripts. See
http://eyegene.ophthy.med.umich.edu/unicode/#termemulator .

> 5) shells: I'm not aware of any line-editing shell (zsh, tcsh)
> that have Unicode character semantics (back-character should move one
> character, not one byte)
>
> 6) databases: I'm not aware of a free DBMS which has support for
> Unicode sorting (SQL's ORDER BY) or regexps (SQL's LIKE).
>

I thought both Postgres and MySQL already have, or are working on this
issue?

> 7) Serious word processing: LaTeX has only very minimum Unicode

Many would argue that Open Office 1.1 needs to be included in the
category of "serious word processing" and it has good
Unicode support.

>
> Also, many applications (exmh, emacs) are ten times slower when
> running in UTF-8 mode.
>

exmh is written in Tcl/Tk: isn't everything written in Tcl/Tk sssllowww?
When was the last time that it really mattered how fast your
editor worked? If emacs is slow, use vi ;-) ... oops, I forgot
this might provoke some people (it's a joke)!
 
> At the present time, using Unicode on Unix is an act of faith.

That is not an accurate statement.

Are you talking about proprietary Unixes or Linux? I thought the
questions were about support on Linux. With regard to Unicode support on
Linux, I completely disagree with you. I use Unicode for serious
work on Linux everyday.

Clearly it really depends on what you want to do. And that is the case
on other OSes as well.

>
> > Default charset for recent versions of some popular distributions.
>
> Yes, RedHat changed the default charset to Unicode without thinking
> that text files were no longer readable.
>
> See:
>
> http://www.cl.cam.ac.uk/~mgk25/unicode.html
> ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
> http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.html
>



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST