Re: That UTF-8 Rant (was Unicode in source)

From: Robert Brady (robert@events.susu.soton.ac.uk)
Date: Thu Jul 22 1999 - 14:15:50 EDT


> o The characters are all 16-bits in the BMP, in terms of processing (yes,
> yes, combining marks take more than one character to encode... but for
> processing purposes everything is the same width. Yes, there are higher
> planes of existence and these cannot be ignored...).

You seem to think that the other planes don't matter. Presumably you
therefore ignore surrogates. You shouldn't do this.

> o There is less text expansion for non-Latin languages.

This is true.

> o There are programmatic ways of handling Unicode text via TCHAR that
> reduces the impact on code. If you don't unthread UTF-8 to UTF-16, text
> processing becomes somewhat uglier.

Um, what's TCHAR when it's at home.

> o For languages other than Western European, the text expansion is much
> smaller than for UTF-8, so storage is conserved.

You said that already.

> If you are going to write a text processing application, why would you make
> UTF-8 the default internally, when UTF-16 is so much easier to code?
> Programmer's editors are, of course, text processing applications, and while
> they need to handle UTF-8 (read and write), internally UTF-16 is going to be
> much cleaner. Are you arguing to use UTF-8 because it makes the lexical
> analyser you've already written able to sorta-kinda process Unicode? I think
> the other messages on this thread clearly show why this is a potential
> problem if we allow Unicode into our identifiers. (It's not a problem if you
> confine Unicode to string literals). Write a lexer than can handle
> UTF-16/UCS-2. It's a lot easier to preprocess all of your text to that
> encoding before lexing it than it is to have multi-octet data.

Again, this all assumes surrogates are ignored. They shouldn't be. I
really hope you aren't ignoring them, because that is a Bad thing to do.

-- 
Robert



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT