Re: Special characters

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Wed Nov 06 2002 - 06:18:14 EST

  • Next message: Marco Cimarosti: "RE: Names for UTF-8 with and without BOM - pragmatic"

    Hello,

    I had written:

    > HTML:
    > · Store your entire page in UTF-8, [...]
    > · Store your entire page in a suitable standard codepage, cf.
    > <http://czyborra.com/charsets/iso8859.html>, [...]
    > · Store your page in some standard CP (as above), and enter the
    > particular problem characters as NCRs, [...]

    Edward H Trager wrote:
    > Even though they are second and third options in your email response,
    > are you sure you want to implicitly encourage someone to use CODEPAGES
    > instead of UTF-8 on their web pages? This is not good advice, I fear.

    I was explicetely referring to "standard codepages", and I included
    a link to a description of the ISO 8859-1 series. I did not mean to
    advocate throwing HTML in proprietary encodings at poor, unsuspecting
    browsers...

    Of course, UTF-8 is the way to go for newly designed, international web
    pages. However, there may be situations where you are forced to particular
    encodings, so I thought I should mention the possibility.

    > One of the biggest headaches I have is trying to read web pages
    written in
    > certain code pages that don't appear correctly under various browsers on
    > my non-Windows workstations (maybe it's a problem on Windows too, I just
    > haven't checked): if those pages had been in UTF-8, then very likely they
    > would at least be readable.

    It would be interesting to know more particulars:

    - Are you sure that the pages causing your hedache were properly tagged
       with the charset?

       I have seen many HTML pages (and e-mail, btw.) encoded in MS CP 1252
       (cf. <http://czyborra.com/charsets/codepages.html#CP1252>) but tagged
       as ISO-8859-1, or even as ASCII; cf. an example in my e-mail FAQ at
       <http://www.systems.uni-konstanz.de/EMAIL/FAQ.php#SMTP-71>.

    - Which CP cannot be properly handled by which browser/OS combo?
       Have you seen anything beyond the findings of Alan Wood, cf.
       <http://www.alanwood.net/unicode/browsers.html>?

       I guess that the ISO 8859 series' encodings will be handled by
       any browser on any system (if correctly configured and supplied
       with suitable fonts) -- but I never had the time and resources
       to test this conjecture.

    A popular browser, Netscape Navigator, version 3 through 4.8, does
    not handle NCRs according to the HTML 4 specification. Alan Wood de-
    scribes this behaviour thus:
    : Numeric character references [...] are supposed be displayed
    : independently of the document's character encoding, but Naviga-
    : tor 4.8 is restricted to the numeric character references that
    : fall within the current encoding (either specified in a meta tag
    : or selected from the View menu). It is normally necessary to select
    : the Unicode (UTF-8) character set from the View menu in order to
    : force numeric character references to be displayed properly.

    The HTML author can easily circumvent this problem via a variant
    of my 3rd alternative, viz.
    · Store your page in ASCII (i. e. 7-bit only!), and enter every
       non-ASCII character as a NCR; but tag your page as UTF-8.

    This works well, as ASCII is a proper subset of UTF-8. This scheme
    is feasable for content that is largely in Latin script, with oc-
    casional national/special characters. It does not require any ad-
    vanced software on the author's side: a simple 8-bit editor, such
    as Notepad from Windows 95, will suffice. NN 4.7/4.8 will happily
    display the characters produced via the NCRs; so will all browsers
    capable of displaying UTF-8-encoded HTML 4.

    Best wishes,
       Otto Stolz



    This archive was generated by hypermail 2.1.5 : Wed Nov 06 2002 - 06:56:59 EST