Re: Nicest UTF

From: D. Starner (shalesller@writeme.com)
Date: Wed Dec 08 2004 - 18:10:58 CST

Next message: Patrick Andries: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."

Previous message: Philippe Verdy: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Marcin 'Qrczak' Kowalczyk" writes:
> String equality in a programming language should not treat composed
> and decomposed forms as equal. Not this level of abstraction.

This implies that every programmer needs an indepth knowledge of Unicode
to handle simple strings. The concept makes me want to replace Unicode;
spending the rest of my life explaining to programmers, and people who use
their programs, why a search for "Römishe Elegien" isn't bringing the book
is not my idea of happiness.

> IMHO splitting into graphemes is the job of a rendering engine, not of
> a function which extracts a part of a string which matches a regex.

So S should _sometimes_ match an accented S? Again, I feel extended misery
of explaining to people why things aren't working right coming on.

> They are supposed to be equivalent when they are actual characters.
> What if they are numeric character references? Should "≮"
> (7 characters) represent a valid plain-text character or be a broken
> opening tag?

Which 7 characters? My email "client" turned them into the actual characters.
But I think it's fairly obvious that XML added entities in part so you
could include '<'s and other characters without them getting interpreted as
part of the text of the document. Similarly, a combining character entity
following an actual < should be the start of a tag.

>Note that if it's a valid plain-text character, it's impossible
>to represent isolated combining code points in XML,

No more then it's impossible to represent '<' in the text.

> I expect breakage of XML-based protocols if implementations are
> actually changed to conform to these rules (I bet they don't now).

Really? In what cases are you storing isolated combining code points
in XML as text? I can think of hypothetical cases, but most real-world
use isn't going to be affected. If I were designing such an XML protocol,
I'd probably store it as a decimal number anyway; XML is designed to
be human-readable, and an isolated combining character that randomly
combines with other characters that it's not logically associated with
when displayed isn't particularly human readable.

> Implementing an API which works in terms of graphemes over an API
> which works in terms of code points is more sane than the converse,
> which suggests that the core API should use code points if both APIs
> are sometimes needed at all.

Implementing an API which works in terms of lists over an API which works
in terms of pointers is more sane than the converse, which suggests that the
core API should use pointers if both APIs are sometimes needed at all.

> While I'm not obsessed with efficiency, it would be nice if changing
> the API would not slow down string processing too much.

Who knows how much it would slow down string processing? If I get around
to writing the test code, I'll try and see how much it slows stuff down,
but right now we don't know.

-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Next message: Patrick Andries: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."
Previous message: Philippe Verdy: "Re: IUC27 Unicode, Cultural Diversity, and Multilingual Computing / Africa is forgotten once again."
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 18:15:04 CST