string vs. char [was Re: Java and Unicode]

From: addison@inter-locale.com
Date: Thu Nov 16 2000 - 16:31:52 EST

Next message: Houman Pournasseh: "RE: Persian decimal separator"
Previous message: addison@inter-locale.com: "Okay, I didn't mean to send that... (fwd)"
Next in thread: Mark Davis: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Mark Davis: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Marco Cimarosti: "RE: string vs. char [was Re: Java and Unicode]"
Maybe reply: Marco Cimarosti: "RE: string vs. char [was Re: Java and Unicode]"
Maybe reply: Antoine Leca: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Marco Cimarosti: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Michael \(michka\) Kaplan: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: David Starner: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Michael \(michka\) Kaplan: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: John Cowan: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: addison@inter-locale.com: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Mark Davis: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Keld Jørn Simonsen: "Re: string vs. char [was Re: Java and Unicode]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Normally this thread would be of only academic interest to me...

...but this week I'm writing a spec for adding Unicode support to an
embedded operating system written in C. Due to Mssrs. O'Conner and
Scherer's presentations at the most recent IUC, I was aware of the clash
between internal string representations and the Unicode Scalar Value
necessary for efficient lookup.

Now I'm getting alarmed about the solution I've selected.

The OS I'm working on is written in C. I considered, therefore, using
UTF-8 as the internal Unicode representation (because I don't have the
option of #defining Unicode and using wchar), but the storage expansion
and the fact that several existing modules grok UTF-16 (well, UCS-2), led
me to go in the direction of UTF-16.

I also considered supporting only UCS-2. It's a bad bad bad idea, but it
gets me out of the following:

I ended up deciding that the Unicode API for this OS will only work in
strings. CTYPE replacement functions (such as isalpha) and character based
replacement functions (such as strchr) will take and return strings for
all of their arguments.

Internally, my functions are converting the pointed character to its
scalar value (to look it up in the database most efficiently).

This isn't very satisfying. It goes somewhat against the grain of 'C'
programming. But it's equally unsatisfying to use a 32-bit representation
for a character and a 16-bit representation for a string, because in 'C',
a string *is* an array of characters. Which is more
natural? Which is more common? Iterating across an array of 16-bit values
or

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

Next message: Houman Pournasseh: "RE: Persian decimal separator"
Previous message: addison@inter-locale.com: "Okay, I didn't mean to send that... (fwd)"
Next in thread: Mark Davis: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Mark Davis: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Marco Cimarosti: "RE: string vs. char [was Re: Java and Unicode]"
Maybe reply: Marco Cimarosti: "RE: string vs. char [was Re: Java and Unicode]"
Maybe reply: Antoine Leca: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Marco Cimarosti: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Michael \(michka\) Kaplan: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: David Starner: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Michael \(michka\) Kaplan: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: John Cowan: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: addison@inter-locale.com: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Mark Davis: "Re: string vs. char [was Re: Java and Unicode]"
Maybe reply: Keld Jørn Simonsen: "Re: string vs. char [was Re: Java and Unicode]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT