Re: Normalization Implementation Tricks

From: Martin v. Löwis (martin@v.loewis.de)
Date: Mon Feb 16 2009 - 03:04:46 CST

Next message: vanisaac@boil.afraid.org: "Updated draft of Chinook proposal."

Previous message: verdy_p: "Re: Normalization Implementation Tricks"
In reply to: Michael D. Adams: "Normalization Implementation Tricks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> The most naive implementation would be a table indexed by a starter
> character and a combining character. Of course that is completely
> unreasonable as it would require 0x110000 * 0x110000 entries (a few
> terabytes).
>
[...]
> Are there any other implementation methods that have a small memory
> footprint (~10-20kb) and quick access (~ 10-20 instructions)? Any
> guidance in this regard would be appriciated.

In Python, we observe that there are 366 starter characters, and
52 combining characters. So if some code point is not a starter,
or its subsequent character is not a combiner, you can proceed.
Otherwise, you have to consider 366*52 combinations. We put them into
a trie, with 4956 bytes for the index, and 8884 bytes for the index.

To lookup, you need to compute the index of the starter, and the
index of the combiner, which we do as a linear search (although
a binary search would probably be better; even faster would be
a trie again).

Regards,
Martin

Next message: vanisaac@boil.afraid.org: "Updated draft of Chinook proposal."
Previous message: verdy_p: "Re: Normalization Implementation Tricks"
In reply to: Michael D. Adams: "Normalization Implementation Tricks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Feb 16 2009 - 03:10:49 CST