Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy ([email protected])
Date: Wed Aug 10 2005 - 14:14:30 CDT

Next message: [email protected]: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

Previous message: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Tom Emerson" <[email protected]>
> This happens *all* the time. I constantly encounter pages that are
> labeled as ISO-8859-1 (actually usually CP1252) and indeed, if you
> just look at the byte values, are valid Latin 1 (or even just
> US-ASCII). However, the content is encoded in HTML escapes, and is
> actually Arabic or Persian. Hence you have to do the detection in a
> couple of steps, since the presence of these entities (remember, an
> X?HTML page can include any character regardless of the declared
> "primary" encoding) opens up all of Unicode.

This is absolutely not needed for a charset detector (i.e. the detection of
the encoding used to serialize the text). HTML escapes are perfectly valid
in HTML, and even if they refer to non Latin-1 characters, this does not
change the fact that the page remains encoded in ISO-8859-1.

You don't need to take HTML escapes into account with regards of which
encoding is used, because these escapes are independant of the actual
encoding used.

With only one exception: some HTML escapes like "" or "" are
used and normally refer to the first C1 control, independantly of the
encoding used. So an HTML renderer should render this C1 control, but it is
normally invalid for HTML text which normally restricts the subset of
Unicode characters (the only acceptable controls are CR, LF, TAB). Some
browsers like IE ignore this kind error and instead attempt to substitute
the codepoint invalid for HTML by another codepoint acceptable in the HTML
subset.

In this case, it will typically convert the invalid codepoint as if it was a
code in a Windows codepage, so here it will render the Euro symbol. This
kind of substitution is based on the effective legacy charset used to encode
the page: if the page is encoded with ISO-8859-1 or Windows-1252, IE will
map the 128 codepoint to the Euro symbol as defined in Windows-1252. This
sort of autocorrection is quite common, but the page is indeed not valid
HTML.

If the page is encoded with UTF-8 or UTF-16, the reference "" is not
remapped and remains associated with the C1 control. In that case, the
character will not be rendered or will be rendered as a square box,
depending on the font used, or if a non-Unicode font is used, the codepoint
is rendered using the codeposition of the glyph in that legacy font. There
are various tricks used there, but it seems that this is done to preserve
the compatibility with texts using legacy charsets and legacy fonts for
which not all characters are mapped to Unicode. I don't know how IE manages
it internally, but this seems like a renderer-specific issue where
non-Unicode characters can be rendered even though they are normally invalid
with strict HTML. The actual algorithm to render these invalid characters
may be even more complex when you consider the special case of "Symbol"
fonts (with their specific codepositions that are mapped to Unicode with a
constant offset).

Next message: [email protected]: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Previous message: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 14:16:07 CDT