Re: A basic question on encoding Latin characters

From: Mark E. Davis (markdavis@ispchannel.com)
Date: Mon Oct 04 1999 - 09:40:22 EDT

Next message: John Cowan: "Re: A basic question on encoding Latin characters"
Previous message: Reynolds, Gregg: "metalanguage (was RE: Why is Unicode inconsistant?)"
Maybe in reply to: Marion Gunn: "A basic question on encoding Latin characters"
Next in thread: John Cowan: "Re: A basic question on encoding Latin characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The only case I can think of where someone would get into trouble is where they
read in the text, converted that text to Form D, and only then parsed the text.
They could then run into a "<" that was not in the original text. This would be
a good test case for XML conformance.

As to the other point you mention, take a look at the spec. Except in special
cases an XML parser will interpret the character reference &#xXXXX; exactly as
it would a character expressed in the character set. Thus it doesn't matter
whether you have U+0338 expressed as a character (e.g. CC B8 in UTF-8) or are
using a numeric entity (e.g. ̸). And in either case, an XML parser will
read the entity reference < before the following character, whether that
character is a combining mark or not.

Mark

Kevin Bracey wrote:

> In message <37F4E501.5A1E89CC@ispchannel.com>
> "Mark E. Davis" <markdavis@ispchannel.com> wrote:
>
> > It is still not a problem. XML requires every instance of '<' where it
> > could be interpreted as the start of a tag to be quoted as "<", so if
> > you wanted to use the combining character sequence it would have to be as
> > "≮". (actually, the second character doesn't need to be quoted
> > if the character set can express it).
> >
>
> What I meant was that it might be supplied in form C, but the user-agent
> might be decomposing everything on input internally, causing a problem.
>
> On your last point; surely you couldn't say <<U+0338>, because that would
> be a semicolon with a slash through it in the source, no? It _would_ have
> to be ≮. Or are we again searching only for base characters in
> the source, ignoring combining marks?
>
> --
> Kevin Bracey, Senior Software Engineer
> Pace Micro Technology plc Tel: +44 (0) 1223 518566
> 645 Newmarket Road Fax: +44 (0) 1223 518526
> Cambridge, CB5 8PB, United Kingdom WWW: http://www.acorn.co.uk/

Next message: John Cowan: "Re: A basic question on encoding Latin characters"
Previous message: Reynolds, Gregg: "metalanguage (was RE: Why is Unicode inconsistant?)"
Maybe in reply to: Marion Gunn: "A basic question on encoding Latin characters"
Next in thread: John Cowan: "Re: A basic question on encoding Latin characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT