Re: Parsing Unicode strings

From: Doug Ewell (dewell@roadrunner.com)
Date: Wed May 28 2008 - 22:06:09 CDT

  • Next message: Doug Ewell: "Re: Stateful?"

    Peter Johansson wrote:

    > Is the Unicode-encoded character string self-descriptive? That is, do
    > I need a priori knowledge that it is encoded as, for example, UTF-8
    > rather than UTF-32? Or, by examining the first byte (or first few
    > bytes) can I determine the encoding?

    The approach taken in Appendix A of the XML specification
    ("Autodetection of Character Encodings") might be of interest:

    http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing

    An XML parser does have the distinct advantage in this case of knowing
    what the first few "real" characters are supposed to be. The problem is
    harder to solve for arbitrary text, but not unreasonably so, and in any
    case most text isn't completely arbitrary.

    > I didn't see anything on this topic in the FAQ.

    That does surprise me, considering the great deal of related information
    on the "UTF-8, UTF-16, UTF-32 & BOM" page.

    --
    Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ
    


    This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 22:09:34 CDT