Re: A basic question on encoding Latin characters

From: Mark E. Davis (
Date: Mon Oct 04 1999 - 09:40:22 EDT

The only case I can think of where someone would get into trouble is where they
read in the text, converted that text to Form D, and only then parsed the text.
They could then run into a "<" that was not in the original text. This would be
a good test case for XML conformance.

As to the other point you mention, take a look at the spec. Except in special
cases an XML parser will interpret the character reference &#xXXXX; exactly as
it would a character expressed in the character set. Thus it doesn't matter
whether you have U+0338 expressed as a character (e.g. CC B8 in UTF-8) or are
using a numeric entity (e.g. &#x0338;). And in either case, an XML parser will
read the entity reference &lt; before the following character, whether that
character is a combining mark or not.


Kevin Bracey wrote:

> In message <>
> "Mark E. Davis" <> wrote:
> > It is still not a problem. XML requires every instance of '<' where it
> > could be interpreted as the start of a tag to be quoted as "&lt;", so if
> > you wanted to use the combining character sequence it would have to be as
> > "&lt;&#x0338;". (actually, the second character doesn't need to be quoted
> > if the character set can express it).
> >
> What I meant was that it might be supplied in form C, but the user-agent
> might be decomposing everything on input internally, causing a problem.
> On your last point; surely you couldn't say &lt;<U+0338>, because that would
> be a semicolon with a slash through it in the source, no? It _would_ have
> to be &lt;&#x0338;. Or are we again searching only for base characters in
> the source, ignoring combining marks?
> --
> Kevin Bracey, Senior Software Engineer
> Pace Micro Technology plc Tel: +44 (0) 1223 518566
> 645 Newmarket Road Fax: +44 (0) 1223 518526
> Cambridge, CB5 8PB, United Kingdom WWW:

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT