Re: UTF-8 Corrigendum, new Glossary

From: Doug Ewell (
Date: Thu Nov 30 2000 - 11:48:54 EST

"G. Adam Stanislav" <> wrote:

>> 1. The Unicode Technical Committee has modified the definition of
>> UTF-8 to forbid conformant implementations from interpreting non-
>> shortest forms for BMP characters,
> I find this silly. That creation of such forms would be forbidden I
> can see and agree to. But interpretation? I understand the reasoning
> when security is an issue. But why make it flat illegal? There are
> many applications where such a sequence poses no security danger.

I used to be concerned about that. I think I cited the example of an
encyclopedia on CD-ROM with text in UTF-8. Obviously this text is all
internal and almost certainly valid, and there are no security holes
involved, so the UTF-8 decoder can take certain shortcuts.

But this is now covered in the corrigendum:

> Internally, a particular function might be used that does not check
> for illegal code unit sequences. However, a conformant process can
> use that function _only_ on data that has already been certified to
> not contain any illegal code unit sequences.

The word "certified" did make me chuckle, though. Who would do the
certifying? Katherine Harris?

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT