Re: Nicest UTF

From: D. Starner (shalesller@writeme.com)
Date: Wed Dec 08 2004 - 15:51:47 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Invalid UTF-8 sequences"

Previous message: Kenneth Whistler: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl> writes:
> "D. Starner" <shalesller@writeme.com> writes:
>
> > You could hide combining characters, which would be extremely useful if we were just using Latin
> > and Cyrillic scripts.
>
> It would need a separate API for examining the contents of a combining
> character. You can't avoid the sequence of code points completely.

Not a seperate API; a function that takes a character and returns an array of integers.

> It would yield to surprising semantics: for example if you concatenate
> a string with N+1 possible positions of an iterator with a string with
> M+1 positions, you don't necessarily get a string with N+M+1 positions
> because there can be combining characters at the border.

The semantics there are surprising, but that's true no matter what you
do. An NFC string + an NFC string may not be NFC; the resulting text
doesn't have N+M graphemes. Unless you're explicitly adding a combining
character, a combining character should never start a string. This could
be fixed several ways, including by inserting a dummy character to hold
the combining character, and "normalizing" the string by removing the dummy
characters. That would, for the most part, only hurt pathological cases.

> It would impose complexity in cases where it's not needed. Most of the
> time you don't care which code points are combining and which are not,
> for example when you compose a text file from many pieces (constants
> and parts filled by users) or when parsing (if a string is specified
> as ending with a double quote, then programs will in general treat a
> double quote followed by a combining character as an end marker).

If you do so with an language that includes <, you violate the Unicode
standard, because ≮ (not <) and ≮ are canonically equivalent. You've
either got to decompose first or look at the individual characters as
a whole instead of looking at code points.

Has anyone considered this while defining a language? How about the official
standards bodies? Searching for XML in the archives is a bit unhelpful, and
UTR #20 doesn't mention the issue. Your solution is just fine if you're
considering the issue on the bit level, but it strikes me as the wrong answer,
and I would think that it would surprising to a user that didn't understand
Unicode, especially in the ≮ case. A warning either way would be nice.

I'll see if I have time after finals to pound out a basic API that implements
this, in Ada or Lisp or something. It's not going to be the most efficient thing,
but I doubt it's going to be a big difference for most programs, and if you want
C, you know where to find it.

-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Invalid UTF-8 sequences"
Previous message: Kenneth Whistler: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 15:52:35 CST