From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri Dec 10 2004 - 10:18:22 CST
Use of the Unicode standard does *not* require constant validation of
strings. The standard carefully distinguishes between Unicode strings
(D29a-d, page 74) and UTFs. The Unicode strings are in-memory
representations of Unicode, but do not have to be valid UTFs; so all Unicode
X-bit strings are valid UTF-X, but not the converse. When interpreting a UTF
you must validate the input. And when generating a UTF you must produce
valid results (for that UTF format). But you don't have to do this with
Unicode strings.
So there are two strategies for handling Unicode in API libraries:
1. Your internal string format is guaranteed to be UTF. So all operations
have to maintain that. Thus in each API you validate all input and output
strings. (Of course, if an operation is guaranteed to maintain validity of a
UTF-X string, like concatenation, you don't need to recheck the output.)
2. Your internal string format is a Unicode string (but not necessarily a
UTF). In that case, the APIs have to "tolerate" odd code units, and may
produce strings that contain odd code units. (But all the input are UTF
strings, any higher-level API should not produce non-UTF strings.) Note that
whenever you export to a protocol requiring UTF (e.g. saving to a file), you
*do* have to validate, either stripping the odd code units or providing some
other error handling (see UTS #22 for more info).
Both of these strategies are legitimate. For 16-bit Unicode, I don't know of
any significant package in practice that does #1; it is just too expensive
and cumbersome, both to check and to handle any exceptions that arise. And
the low odds of encountering a loose surrogate, and the ease of tolerating
it (just treated like unassigned code point) make #2 very reasonable. I'm
not as familiar with packages using 8-bit Unicode, but certainly the base C
routines make no such guarantees, so I doubt it would be viable to try to
follow that strategy in practice, at least in C.
Mark
----- Original Message -----
From: "Arcane Jill" <arcanejill@ramonsky.com>
To: "Unicode" <unicode@unicode.org>
Sent: Friday, December 10, 2004 06:46
Subject: When to validate?
> Here's something that's been bothering me. Suppose I write a function -
> let's call it trim(), which removes leading and trailing spaces from a
> string, represented as one of the UTFs. If I've understood this correctly,
> I'm supposed to validate the input, yes?
>
> Okay, now suppose I write a second function - let's call it tolower(),
which
> lowercases a string, again represented as one of the UTFs. Again, I guess
> I'm supposed to validate the input. yes?
>
> And yet, in an expression such as tolower(trim(s)), the second validation
is
> unnecessary. The input to tolower() /must/ be valid, because it is the
> output of trim(). But on the other hand, tolower() could be called with
> arbitrary input, so I can't skip the validation.
>
> For efficiency, I /could/ assume that all input was already valid - but
> then, what if it isn't? Or I could validate all input - but that's
> inefficient. Or I could write two versions of each function, one
validating,
> the other not, but that adds too much complexity. It seems to me that not
> validating input to such functions would give you the best performance,
but
> then in order to remain compliant you'd have to do the validation
somewhere
> else - for example something like
>
> t = tolower(trim(validate(s))).
>
> where validate(s) does nothing but throw an exception if s is invalid.
>
> Other people must have had to make decisions like this. What's the
preferred
> strategy?
> Arcane Jill
>
>
>
This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 10:21:01 CST