Re: Strange Browser Behavior

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jan 08 2007 - 11:59:48 CST


Firefox on Windows displays an Armenian letter (or is it a ligature? I can't identify it in any of my installed fonts) followed a question mark: it can't be FEFF E083 DDBD.

I see lots of binary similarities between 230DBD and FEFF E083 DDBD
It seems that the out of range codepoint 230DBD is trated like a supplementary codepoint out of the BMP, so your browser tries to encode it internally into a pair of surrogates. FEFF must just be the leading byte order mark when you look at the source after saving it to UTF-16, or parsing it with javascript:

<html><body>
<div id="x">&#2297277;</div>
<script language="javascript">
var hD="0123456789ABCDEF";
function d2h(d) {var h=hD.substr(d&15,1);while(d>15){d>>=4;h=hD.substr(d&15,1)+h}return h;}
x=document.getElementById("x").innerHTML;
document.write("x.length="+x.length+"<br/>");
for (i=0;i<x.length;i++) document.write("x["+i+"]='"+x.charAt(i)+"' (0x"+d2h(x.charCodeAt(i))+")<br/>");
</script></body></html>

I get x.length=2, x[0]=0xE083, x[1]=0xDDBD. this proves that FEFF is just the byte order mark, for the document, not a code returned by the strange numeric reference.

Note that Javascript strings store its character positions as UTF-16 code-units: for example if i use &#x10000; instead of your numeric reference, I also get x.length()=2, and x[0]=0xD800, x[1]=0xDC00.

So it seems that Firefox really attempts to convert the codepoint to UTF-16 as:

x[0] = (char)(0xD800+((codepoint-0x10000)>>10))
 = (char)(0xD800+((0x230DBD-10000)>>10))
 = (char)(0xD800+(0x220DBD>>10))
 = (char)(0xD800+0x883)=(char)0xE083

x[1] = (char)(0xDC00+((codepoint-0x10000)&0x3FF))
  = (char)(0xDC00+((0x230DBD-0x10000)&0x3FF))
 = (char)(0xDC00+0x1BD)=(char)(0xDDBD)

In other words, Firefox doesnot check the range, and generates a character that is not a surrogate for the first index. As the second index is an isolated low surrogate, it is displayed as a question mark.
The first index is now a PUA, and depends on local fonts usage.

----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: "Unicode Mailing List" <unicode@unicode.org>
Cc: "Tom Gewecke" <tom@bluesky.org>
Sent: Monday, January 08, 2007 4:21 PM
Subject: Re: Strange Browser Behavior

> IE 7 displays a single ASCII question mark.
>
> 2297277 decimal is 230DBD hex, so I don't yet see where <FEFF, E083,
> DDBD> would come from.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:55:40 CST