Re: ZWJ&XML

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Sep 13 2006 - 03:38:52 CDT

  • Next message: Doug Ewell: "Re: registration of dialects"

    On Wed, 13 Sep 2006, Jose wrote:

    > Unicode Technical Report #20 (Unicode in XML and other Markup
    > Languages) http://www.Unicode.org/Unicode/reports/tr20/ specifies that
    > Zero-width Joiners/ nonjoiners (ZWJ and ZWNJ) are suitable for use with
    > in the markup.

    Yes, for affecting ligature and joining behavior. I mention this because
    there is a popular word processor that uses ZWJ and ZWNJ quite
    inappropriately for line break control.

    Of course, the statement is of general nature: those characters are in
    principle suitable for use in marked-up text. It does not guarantee or
    prescribe that a particular markup system allows them or that they will be
    interpreted by their Unicode semantics.

    > But when an xml file with the tags written in Malayalam
    > using ZWJs (In Malayalam ZWJ is used to form certain characters) an
    > error is reported that the tag contained an invalid character.

    Reported by which program? I first suspected that you may have tried to
    enter these characters but they do not appear correctly in the declared or
    implied character encoding.

    But reading again, I notice that you are referring to _tags_ and might
    actually mean the use of characters in element or attribute names, as
    opposite to their use in content between tags. UTR #20 discusses the
    latter, i.e. what you can use in document content proper - together with
    markup, not _inside_ markup (tags).

    The use of characters in element and attribute names is governed by the
    use of each markup language, basically in the _identifier_ syntax.
    Generally, and in XML 1.0, control characters are excluded in that syntax,
    and ZWJ and ZWNJ are control characters by definition (General Category:
    Cf). Thus, an attempt to use them in element names would violate
    well-formedness constraints, and an XML parser would report an error - not
    about an invalid character per se but about a syntax error.

    In XML 1.1, ZWJ and ZWNJ are allowed in identifiers, but this is probably
    of little practical value.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Wed Sep 13 2006 - 03:42:36 CDT