Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 10 2005 - 14:14:30 CDT

  • Next message: eflarup@yahoo.com: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    From: "Tom Emerson" <tree@basistech.com>
    > This happens *all* the time. I constantly encounter pages that are
    > labeled as ISO-8859-1 (actually usually CP1252) and indeed, if you
    > just look at the byte values, are valid Latin 1 (or even just
    > US-ASCII). However, the content is encoded in HTML escapes, and is
    > actually Arabic or Persian. Hence you have to do the detection in a
    > couple of steps, since the presence of these entities (remember, an
    > X?HTML page can include any character regardless of the declared
    > "primary" encoding) opens up all of Unicode.

    This is absolutely not needed for a charset detector (i.e. the detection of
    the encoding used to serialize the text). HTML escapes are perfectly valid
    in HTML, and even if they refer to non Latin-1 characters, this does not
    change the fact that the page remains encoded in ISO-8859-1.

    You don't need to take HTML escapes into account with regards of which
    encoding is used, because these escapes are independant of the actual
    encoding used.

    With only one exception: some HTML escapes like "&#128;" or "&#x80;" are
    used and normally refer to the first C1 control, independantly of the
    encoding used. So an HTML renderer should render this C1 control, but it is
    normally invalid for HTML text which normally restricts the subset of
    Unicode characters (the only acceptable controls are CR, LF, TAB). Some
    browsers like IE ignore this kind error and instead attempt to substitute
    the codepoint invalid for HTML by another codepoint acceptable in the HTML
    subset.

    In this case, it will typically convert the invalid codepoint as if it was a
    code in a Windows codepage, so here it will render the Euro symbol. This
    kind of substitution is based on the effective legacy charset used to encode
    the page: if the page is encoded with ISO-8859-1 or Windows-1252, IE will
    map the 128 codepoint to the Euro symbol as defined in Windows-1252. This
    sort of autocorrection is quite common, but the page is indeed not valid
    HTML.

    If the page is encoded with UTF-8 or UTF-16, the reference "&#128;" is not
    remapped and remains associated with the C1 control. In that case, the
    character will not be rendered or will be rendered as a square box,
    depending on the font used, or if a non-Unicode font is used, the codepoint
    is rendered using the codeposition of the glyph in that legacy font. There
    are various tricks used there, but it seems that this is done to preserve
    the compatibility with texts using legacy charsets and legacy fonts for
    which not all characters are mapped to Unicode. I don't know how IE manages
    it internally, but this seems like a renderer-specific issue where
    non-Unicode characters can be rendered even though they are normally invalid
    with strict HTML. The actual algorithm to render these invalid characters
    may be even more complex when you consider the special case of "Symbol"
    fonts (with their specific codepositions that are mapped to Unicode with a
    constant offset).



    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 14:16:07 CDT