Re: Parsing Unicode strings

From: Doug Ewell (dewell@roadrunner.com)
Date: Wed May 28 2008 - 22:06:09 CDT

Next message: Doug Ewell: "Re: Stateful?"

Previous message: Kenneth Whistler: "Re: Parsing Unicode strings"
In reply to: Peter Johansson: "Parsing Unicode strings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Johansson wrote:

> Is the Unicode-encoded character string self-descriptive? That is, do
> I need a priori knowledge that it is encoded as, for example, UTF-8
> rather than UTF-32? Or, by examining the first byte (or first few
> bytes) can I determine the encoding?

The approach taken in Appendix A of the XML specification
("Autodetection of Character Encodings") might be of interest:

http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing

An XML parser does have the distinct advantage in this case of knowing
what the first few "real" characters are supposed to be. The problem is
harder to solve for arbitrary text, but not unreasonably so, and in any
case most text isn't completely arbitrary.

> I didn't see anything on this topic in the FAQ.

That does surprise me, considering the great deal of related information
on the "UTF-8, UTF-16, UTF-32 & BOM" page.

--
Doug Ewell  *  Arvada, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

Next message: Doug Ewell: "Re: Stateful?"
Previous message: Kenneth Whistler: "Re: Parsing Unicode strings"
In reply to: Peter Johansson: "Parsing Unicode strings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 22:09:34 CDT