GB18030 (was Re: FYI: Google blog on Unicode)

From: Michael D'Errico (mike-list@pobox.com)
Date: Mon Feb 08 2010 - 23:03:38 CST

  • Next message: verdy_p: "Re: FYI: Google blog on Unicode"

    Can anyone point me to a reference for converting between GB18030
    and Unicode (in English)?

    Thanks,

    Mike

    Doug Ewell wrote:
    > Mark Davis ☸ wrote:
    >
    >> There are really two methodologies in question.
    >>
    >> 1. Accept the charset tagging without question.
    >> 2. Use charset detection, which uses a number of signals. The primary
    >> signal is a statistical analysis of the bytes in the document, but the
    >> charset tagging is taken into account (and can sometimes make a
    >> difference).
    >>
    >> The issue is whether, on balance, which of these produces better
    >> results for web pages and other documents. And with pretty exhaustive
    >> side-by-side comparisons of encodings, it is clear that #2 does,
    >> overwhelmingly.
    >
    > What about option 1½: Use charset detection, assisted by the charset
    > tagging. That is, if the content is valid UTF-8 or UTF-16, or something
    > else unambiguous like GB18030, ignore the tagging and trust the
    > detection algorithm fully. But if the algorithm shows that it could
    > reasonably be any of 8859-1 or -2 or -15, and it is tagged as 8859-2,
    > trust the tag. Just a thought.



    This archive was generated by hypermail 2.1.5 : Mon Feb 08 2010 - 23:03:00 CST