RE: Subj: Unicode form field validation in javascript

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 01 2007 - 13:50:59 CDT

  • Next message: Christopher Fynn: "Re: questions on implementing an embeded system that supports unicode"

    knez.dusan@gmail.com wrote:
    > $legalChars = "/\p{L}|\p{Pc}|\p{N}/"; // check for letters, numbers and
    > underscores $legalCharsCount = preg_match_all($legalChars,$strng,$blb);
    > $illegalCharsCount = mb_strlen($strng,"UTF-8") - $legalCharsCount;
    >
    > I wonder, how to implement the similar javascript validation with regular
    > expressions on client side.

    Your test is basically, just verifying that there are no characters in the
    string that don't match your regexp, which should be better written by
    writing it as:
    "/[\p{L}\p{Pc}\p{N}]/" using an explicit set.

    Note however that your expression will not allow people to enter all their
    names the way they want them written, because this expression isexcluding
    combining characters (for example Hebrew and Arabic vowel points, or all
    vowels from the Indic abugidas).

    I think you will also receive complains about those using apostrophes (don't
    assume they are necessarily "closing" on right) and hyphens, or needed
    whitespaces in their composed names. (Here you are only accepting letters,
    some punctuation, and digits/numbers, but no combining diacritic, no hyphen,
    no apostrophes, and no space; for some languages you'll need other
    "punctuation" marks as well because they are not used as punctuation in the
    native language but as part of the orthographic system, as if they were
    letters).

    If you want to make sure that people name will be correctly allowed, you may
    first look at the list of languages you want to support, and then look at
    the list of characters needed for it, which you can find in the many UNHDR
    texts present now on the Unicode site: each text comes with a list of the
    characters it uses.

    Javascript does not have a built-in support for this level of regular
    expressions. However it does support some regular expressions that couldbe
    built using a simpler syntax for the set of allowed characters.

    But note also that Javascript internally handles strings encoded with
    UTF-16, not UTF-8 like what you are doing in PHP on the server-side.
    This means that the set must be described using only UTF-16 code-units. Note
    also that Javascript does not enforce the codepointboundaries in the UTF-16
    encoding (so characters out of the BMP that are accepted by your regexp,
    would not be accepted in a Javascript Regexp, where they would be seen only
    as separate surrogates).

    So I suggest you to initialise a constant array containing the ranges of
    accepted characters; This can be done with just a single array of javascript
    integers (the even index stores the start of an accepted range, the odd
    index stores the start of a rejected range), and then using a simple
    dichomotic search for each codepoint to check.

    You'll also need a simple loop to scan the javascript string to detect
    surrogates and associate them in pairs to convert them into codepoints. If
    surrogates are unpaired, you can return a false value from your test
    function to signal it contains invalid javascript "characters". Note that
    Javascript characters are, not the same as Unicode codepoints or abstract
    characters: javascript strings are just vectors of 16-bit code-units (a
    Javascript string can safely store invalid codepoints, and in fact any
    vector in any order of codeunits in the full range \x0000 to \xFFFF)

    However it seems difficult to enter such invalid text in a browser, that
    must perform a validation for XML conformance, even if a Javascript string
    could store the string (what this means is that not all Javascript strings
    are assignable or retrievable from a HTML/XML-bound input object). But I
    would not bet it, because of possible bugs in browsers, or limitation in
    their editors that allow entering such text invalid for Unicode or for
    XML/HTML transmission, despite they may be present in the local input
    object.

    The implementation is then simple to do, if your regular expression is
    fixed: it just requires an equivalent initialized constant vector of
    integers.



    This archive was generated by hypermail 2.1.5 : Wed Aug 01 2007 - 13:51:57 CDT