RE: Limits in UBA

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Wed, 22 Oct 2014 19:18:38 +0000

Eli,

> > Embeddings are common in generated text. The guiding principle, is
> seemingly, when in doubt wrap the string in an embedding. At the UTC, we
> heard, that this can lead to very deep stacks - but I've never actually seen
> one with more than 63 levels - but that is not my topic here.
>
> I'd appreciate some pointers to such texts, if they are publicly
> accessible. I'd be very interested to see why such deep embeddings
> are necessary.

They aren't necessary for human-generated text. There is no normal human text
reading case for them. But as Andrew indicated, the problem arises from
the potential for automated injection of text wrapped in an embedding.
There is no expectation that any of that would actually be readable
text in most cases. But on the other hand, the generated text could
end up in logs or other text stores which, in turn, could end up processed
by some text rendering for display in a window somewhere. You don't
then want an arbitrarily low limit for handling embeddings in the UBA to
suddenly crap out the display: that just leads to bug reports and a lot
of confused thrashing up and down the customer support chain.

An example I could think of off the top of my head might involve some
complicated database application working with Arabic data. If the
mechanism generating some automated queries was automatically
encapsulating string literals in the "where db103.tbl246.col27='blah'"
qualifiers *and* the query was encapsulating each full "select xxx"
statement *and* the query was using nested subqueries, then
if the generation of the query ended up nesting 32 subqueries
(which can occur, although it might not be good practice), then
you would already have bumped over the prior 63 level embedding
limit for UBA.

With *big* database applications, where installations may have thousands
of tables, with thousands of partitions, and multiple terabytes of data, automated
generation of very large and complicated SQL queries is common. And
while the database itself doesn't care about UBA or display order when
parsing and compiling such queries, the SQL text can be and *is*
routinely logged. And the worry by the UTC is that when such logged
generated text might include encapsulated embedded chunks, you
don't want UBA per se to be introducing limits that cause failures
when there might be a use case to display such text for diagnostics,
for example. I don't happen to *know* of a particular example of such
text to point you to, but that kind of thing is the relevant use scenario.

--Ken

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Oct 22 2014 - 14:20:14 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 22 2014 - 14:20:14 CDT