Re: [OT] HTML charset declarations (was: GSM and Unicode)

Date: Wed Nov 05 2003 - 15:36:05 EST

  • Next message: Doug Ewell: "Re: UTF8 and COntrol Characters"

    Ok, let's forget about the HTML discussion and let's talk about XML:

    In a message dated 11/5/2003 12:11:21 PM Pacific Standard Time, writes:

    One can however use it safely with XHTML, because XHTML documents are XML
    documents which may specify explicitly another document schema that includes
    this extra attribute (thanks to the modular model of XHTML). But you'll have
    to provide your own XML schema...
    hum... not quite the same. Be carefully here. It depend on what MIME type you
    used in the Content-Type for your xhtml....
    you need to carefully read the following two documents
    1. RFC 3023- XML Media types
    2. XHTML Media Type

    Note that for XHTML, which must be a valid XML document, UTF-8 is the
    default if nothing is specified.
    Not true, according to XHTML Media Type if you are using "application/xhtml+xml" or "application/xml" for your
    xhtml, then "UTF-8 is the default if nothign is specified". However, if you use
    "text/xml" as your Content-Type in the header. Read the following text from RFC
    3023- XML Media types :
    [begin of quote]
    3.6 Summary

       The following list applies to text/xml, text/xml-external-parsed-
       entity, and XML-based media types under the top-level type "text"
       that define the charset parameter according to this specification:

       o Charset parameter is strongly recommended.

       o If the charset parameter is not specified, the default is "us-
          ascii". The default of "iso-8859-1" in HTTP is explicitly

       o No error handling provisions.

       o An encoding declaration, if present, is irrelevant, but when
          saving a received resource as a file, the correct encoding
          declaration SHOULD be inserted.
    [end of quote]

    Notice, it say not only the "us-ascii" is the default if there are no charset
    parameter in the HTTP Content-Type header. It ALSO said that "any encoding
    declaration" (that mean <?xml encoding=""?>) ", if present, is irrevleant".
    (Supprise :) )

    But the XML declaration may be added on top
    to specify the charset to use when parsing the XML document. In that case,
    the XML declaration in the document takes precedence on the external HTTP
    header, which itself takes precedence on the <meta http-equiv /> elements.
    That is not what the RFC 3023 say. Actaully, in RFC3023, it say such XML
    declaration should have no effect if received over HTTP protocol.

    So if you want full XML compliance and support for legacy browsers, you need

    First thing need to be done. Add charset=UTF-8 to the HTTP Content-Type
    header itself if you are using "text/xml'. or the other approach is to use non
    "text" MIME Content-Type.

        - use a leading <?xml ?> declaration with the explicit charset
    Not a bad idea to do it anyway.

        - declare the <!DOCTYPE > with your own schema, and make this extended
    schema accessible at the referenced SYSTEM url, and give it a specific
    PUBLIC doctype name.

        - use a <meta http-equiv /> tag very soon in your <head> section, even
    before any possibly internationalized string like the <title></title>
    element (in fact it is recommanded to put ALL <meta http-equiv /> elements
    before the required <title></title> element and then only put the other
    <meta name /> elements such as robots control tags, description and

        - avoid all line breaks within <meta http-equiv /> elements (needed for
    some web servers tuned for performance and that can parse lazily the HTML
    document before generating HTTP headers), unless you can control the
    generation of HTTP headers (with a external server control file like
    .httpd.conf or similar features, or if you generate headers yourself within
    a server-side script)
    no clue why you need this.

        - make sure you insert a space before all abbreviated elements
    terminators "/>"

        - always specify explicitly the "iso-8859-1" document charset with the
    above method, if this is the one you use, as the default charset differs
    between HTML (which defaults to ISO-8859-1) and XHTML (which defaults to
    UTF-8, per XML conformance, unless there's a leading BOM to specify UTF-16
    or UTF-32)

    Frank Yung-Fong Tang
    System Architect, Itrntinl Dvlpmet, AOL Intrtv Srvies
    AIM:yungfongta Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan

    John 3:16 "For God so loved the world that he gave his one and only Son, that
    whoever believes in him shall not perish but have eternal life.

    Does your software display Thai language text correctly for Thailand users?
    -> Basic Conceptof Thai Language linked from Frank Tang's
    Itrntinliztin Secrets
    Want to translate your English text to something Thailand users can
    understand ?
    -> Try English-to-Thai machine translation at

    This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 16:22:12 EST