Re: Surrogate pairs and UTF-8

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Wed Jun 21 2006 - 17:18:25 CDT

Next message: Rick McGowan: "IUC 30 Program Announced"

Previous message: Kenneth Whistler: "Re: Surrogate pairs and UTF-8"
In reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Mike: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Pavils Jurjans wrote on Wednesday, June 21, 2006 at 9:05 AM

> However, I am
> completely lacking in-depth information about the surrogate pairs and how
> to
> handle them in UTF-8. So, here are the questions, what I am looking for:
> - I have read the theoretical definition of what a surrogate pair is.
> However, I have never seen any in "life". Can you give an example of some
> surrogate pairs, and how do their respective character look like?

Well if you realy want to look at them, and have the bandwidth, you could
always take a look at the complete set of codecharts at
http://www.unicode.org/Public/5.0.0/charts/CodeCharts-5.0.0d3.pdf -
33Mbytes! Less ambitiously, just browse what's at
http://www.unicode.org/charts/ . The URLs of the blocks tell you the
starting codepoints. Just as in the BMP, many of them are CJK characters.

For an example from the smaller files, you have 𐀣 U+10023 LINEAR B SYLLABLE
B016 QA. If I save it as the only character in a file on Windows XP in
UTF-16, I get the byte sequence FF, FE, 00, D8, 23, DC. (The byte sequence
FF, FE tells one it's little-endian.) If I save it as UTF-16 big-endian, I
get FE, FF, D8, 00, DC, 23. Finally, if I save it as UTF-8, I get EF, BB,
BF, F0, 90, 80, A3 - the first three bytes are again the 'byte-order' mark.
I don't know if you could call the last four bytes a 'surrogate quartet' :-)

Actually, there is a minor issue with converting surrogate code-units to
UTF-8. While surrogate pairs present little problem, my code has failed
various tests because of the way it handled unpaired surrogates. When doing
intermediate manipulation in UTF-8, I convert unpaired surrogates to U+FFFD
REPLACEMENT CHARACTER. I then came unstuck in the collation tests because
unpaired surrogates and U+FFFD collate differently!

Richard.

Next message: Rick McGowan: "IUC 30 Program Announced"
Previous message: Kenneth Whistler: "Re: Surrogate pairs and UTF-8"
In reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Mike: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 17:51:33 CDT