Re: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 05 2003 - 18:01:17 EST

Next message: Philippe Verdy: "Re: UTF-16 inside UTF-8"

Previous message: Philippe Verdy: "Re: [hebrew] Re: Hebrew composition model, with cantillation marks"
In reply to: Doug Ewell: "Re: UTF-16 inside UTF-8"
Next in thread: Philippe Verdy: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Doug Ewell" <dewell@adelphia.net>

> I don't know about the relative market needs. I think supplementary
> character support is important because these characters are part of
> Unicode just as much as BMP characters are, and implementing UTF-8
> support for the entire Unicode code space is about 0.1% harder than
> artificially crippling it by restricting it to the BMP.

For the UTF-8 encoding scheme only, used in interchanged data, I agree.

But I disagree with you in these areas which are basic to any decent SQL
engine:
    - the collation order: needed for compare binary operators,
'between...and', sorts with 'order by';
    - char/vcarchar datatype length integrity constraints; and the length()
function in expressions (note that integrity constraints on 'char(n)' may
also imply physical constraints on the storage: how do you allocate storage
space?)
    - substring operations: how do you define the position indices in
strings: counted in terms of UTF-16 code units or in terms of code points?
    - regular expressions: either the SQL-ANSI limited 'like' operator with
'?' and '_' or more complex regular expressions with [allowed characters]
[^disallowed characters] | alternatives {repetitions}+* or options? and
(grouping): they also need to treat pairs of surrogates as a single
codepoint;
    - capability of external storage table formats;
    - negociation of allowed charsets and their encodings with clients;

All these will require more than 0.1% of additional work, should it be only
in the design path... And I don't think that the core engine will be updated
to remap the internal datatype used to represent code units from 16-bit to
32-bit for all strings (think about performance issues, notably for disk
swaps)

Next message: Philippe Verdy: "Re: UTF-16 inside UTF-8"
Previous message: Philippe Verdy: "Re: [hebrew] Re: Hebrew composition model, with cantillation marks"
In reply to: Doug Ewell: "Re: UTF-16 inside UTF-8"
Next in thread: Philippe Verdy: "Re: UTF-16 inside UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 19:02:25 EST