Re: UTF-16 inside UTF-8

From: YTang0648@aol.com
Date: Wed Nov 05 2003 - 20:08:53 EST

  • Next message: Philippe Verdy: "Re: Merging combining classes, was: New contribution N2676"

    In a message dated 11/5/2003 3:18:51 PM Pacific Standard Time,
    verdy_p@wanadoo.fr writes:
    From: "Doug Ewell" <dewell@adelphia.net>

    > I don't know about the relative market needs. I think supplementary
    > character support is important because these characters are part of
    > Unicode just as much as BMP characters are, and implementing UTF-8
    > support for the entire Unicode code space is about 0.1% harder than
    > artificially crippling it by restricting it to the BMP.

    For the UTF-8 encoding scheme only, used in interchanged data, I agree.

    But I disagree with you in these areas which are basic to any decent SQL
    engine:
        - the collation order: needed for compare binary operators,
    'between...and', sorts with 'order by';
        - char/vcarchar datatype length integrity constraints; and the length()
    function in expressions (note that integrity constraints on 'char(n)' may
    also imply physical constraints on the storage: how do you allocate storage
    space?)
        - substring operations: how do you define the position indices in
    strings: counted in terms of UTF-16 code units or in terms of code points?
        - regular expressions: either the SQL-ANSI limited 'like' operator with
    '?' and '_' or more complex regular expressions with [allowed characters]
    [^disallowed characters] | alternatives {repetitions}+* or options? and
    (grouping): they also need to treat pairs of surrogates as a single
    codepoint;
        - capability of external storage table formats;
        - negociation of allowed charsets and their encodings with clients;

    All these will require more than 0.1% of additional work, should it be only
    in the design path... And I don't think that the core engine will be updated
    to remap the internal datatype used to represent code units from 16-bit to
    32-bit for all strings (think about performance issues, notably for disk
    swaps)
    I agree with all verdy said, except
    " - negociation of allowed charsets and their encodings with clients;"
    I don't think is will be impact by adding surrogate support or not.

    --
    Frank Yung-Fong Tang
    System Architect, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes 
    AIM:yungfongta mailto:ytang0648@aol.com Tel:650-937-2913 
    Yahoo! Msg: frankyungfongtan
    


    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 20:54:45 EST