Re: Other Question, Problem, or Feedback

From: Richard Wordingham (
Date: Mon Jun 12 2006 - 21:15:27 CDT

  • Next message: Richard Wordingham: "Re: triple diacritic (sch with ligature tie in a German dialect writing document)"

    ----- Original Message -----
    From: "Magda Danish (Unicode)" <>
    To: <>
    Cc: <>
    Sent: Monday, June 12, 2006 8:04 PM
    Subject: FW: Other Question, Problem, or Feedback

    > -----Original Message-----
    > Date/Time: Sat Jun 10 14:54:43 CDT 2006
    > Contact:
    > Name:
    > Report Type: UTF-16 & UTF-32
    > I haven't been able to find a an answer in the FAQ or googling the site to
    > these questions...
    > 1.Is it true that there are many ways of encoding the same character in
    > UTF-16?

    No. There is exactly one way of encoding each character in UTF-16. See TUS
    4.0 Section 2.5 'Encoding Forms', especially p29.

    > Do you know if common regular expression search functions like those of
    > .NET or Perl will find a character regardless of in what fashion it was
    > encoded?

    This problem therefore does not arise.

    > 2.Why is there now UTF-32?

    Binarism. A 27-bit word is perfectly capable of representing any valid
    codepoint. Anything that can be validly done with UTF-32 can be done with
    any word size from 21 bits upwards. (Any one contemplating using a
    non-binary representation should consult the final part of TUS 4.0 Section
    2.4 for the implications on Unicode data tables :-).

    > Are there even that many characters in the world that they need 32-bit
    > representation?

    If everyone invented a character and it were accepted, despite the alleged
    rule on not encoding novel or idiosyncratic characters ('Note, however, that
    the Unicode Standard does not encode idiosyncratic, personal, novel, or
    private-use characters, nor does it encode logos or graphics.' - TUS 4.0
    Section 1.1 Paragraph 3), 32 bits would not be enough. However, it is
    currently strenuously maintained that 21 bits will suffice. The range of
    values is 0 to 0x10FFFF (TUS 4.0 Section 2.4 Paragraph 3).

    This archive was generated by hypermail 2.1.5 : Mon Jun 12 2006 - 21:30:11 CDT