Discussion:
[MarkLogic Dev General] Efficient iterating over values from a large lexicon
(too old to reply)
Steve Mallen
2011-06-02 10:32:11 UTC
Permalink
Hi all,

I'm having problems processing a large lexicon of values and wondered if
anyone had done something similar or had any ideas of how best to deal
with them.

Basically, I've got a set of several million distinct values, and I want
to precompute a bunch of statistics for each of them (so that I can then
facet/sort values on the computed statistic). So, my plan is to fetch
all the values from the lexicon (storing them in a temp file, say), and
then run an XQuery on each value and store the resulting information in
a document (i.e. one stat document per value). I cannot do this in a
single query as it would take far too long to iterate over all values
and for all the computations and inserts.

But I can't seem to figure out the best way of fetching and iterating
over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
values). In SQL, I'd use a CURSOR to fetch the values one by one, and
then close the cursor at the end. There doesn't seem to be an analogous
concept in XQuery or XCC. I've tried something along the following lines:

(cts:element-values( xs:QName(lexi) ))[$start to $end]

and fetching the values in blocks until I run out of values but I'm
worried that this isn't very efficient, and I've got this nagging doubt
that the above will never return the empty sequence when $start is past
the end of the values. I'm not even sure how I should get a count of
the number of distinct values (xdmp:estimate doesn't work on the result
of cts:element-values()).

So - do you guys know of a way of efficiently iterating over a large set
of lexicon values without timing out the query on the server?

If I'm missing an obvious solution, please let me know...

-Steve
Michael Sokolov
2011-06-02 11:55:12 UTC
Permalink
Steve - there is a "limit=nnn" option to those lexicon functions that
should be the fastest thing, even if the predicate isn't optimized.
Also, the first argument allows you to specify a start position *by value*.

So:

$values := cts:element-values( xs:QName(lexi), "", "limit=1000")
$last := $values[1000]

say ...

followed by


cts:element-values( xs:QName(lexi), $last, "limit=1000")

I guess you'd get some overlap between the first and last values of
subsequent iterations, but this shouldn't slow down as you progress
through the list

-Mike

On 6/2/2011 6:32 AM, Steve Mallen wrote:
> Hi all,
>
> I'm having problems processing a large lexicon of values and wondered if
> anyone had done something similar or had any ideas of how best to deal
> with them.
>
> Basically, I've got a set of several million distinct values, and I want
> to precompute a bunch of statistics for each of them (so that I can then
> facet/sort values on the computed statistic). So, my plan is to fetch
> all the values from the lexicon (storing them in a temp file, say), and
> then run an XQuery on each value and store the resulting information in
> a document (i.e. one stat document per value). I cannot do this in a
> single query as it would take far too long to iterate over all values
> and for all the computations and inserts.
>
> But I can't seem to figure out the best way of fetching and iterating
> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
> then close the cursor at the end. There doesn't seem to be an analogous
> concept in XQuery or XCC. I've tried something along the following lines:
>
> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>
> and fetching the values in blocks until I run out of values but I'm
> worried that this isn't very efficient, and I've got this nagging doubt
> that the above will never return the empty sequence when $start is past
> the end of the values. I'm not even sure how I should get a count of
> the number of distinct values (xdmp:estimate doesn't work on the result
> of cts:element-values()).
>
> So - do you guys know of a way of efficiently iterating over a large set
> of lexicon values without timing out the query on the server?
>
> If I'm missing an obvious solution, please let me know...
>
> -Steve
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
McBeath, Darin W (ELS-STL)
2011-06-02 12:29:17 UTC
Permalink
I would also suggest you use the task server ... Break the big job into smaller jobs (processing say 1000 values in each go) and then spawn these tasks on the task server. I do this fairly often and it has worked well for me.

Darin.


On Jun 2, 2011, at 7:55 AM, "Michael Sokolov" <sokolov at ifactory.com> wrote:

> Steve - there is a "limit=nnn" option to those lexicon functions that
> should be the fastest thing, even if the predicate isn't optimized.
> Also, the first argument allows you to specify a start position *by value*.
>
> So:
>
> $values := cts:element-values( xs:QName(lexi), "", "limit=1000")
> $last := $values[1000]
>
> say ...
>
> followed by
>
>
> cts:element-values( xs:QName(lexi), $last, "limit=1000")
>
> I guess you'd get some overlap between the first and last values of
> subsequent iterations, but this shouldn't slow down as you progress
> through the list
>
> -Mike
>
> On 6/2/2011 6:32 AM, Steve Mallen wrote:
>> Hi all,
>>
>> I'm having problems processing a large lexicon of values and wondered if
>> anyone had done something similar or had any ideas of how best to deal
>> with them.
>>
>> Basically, I've got a set of several million distinct values, and I want
>> to precompute a bunch of statistics for each of them (so that I can then
>> facet/sort values on the computed statistic). So, my plan is to fetch
>> all the values from the lexicon (storing them in a temp file, say), and
>> then run an XQuery on each value and store the resulting information in
>> a document (i.e. one stat document per value). I cannot do this in a
>> single query as it would take far too long to iterate over all values
>> and for all the computations and inserts.
>>
>> But I can't seem to figure out the best way of fetching and iterating
>> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
>> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
>> then close the cursor at the end. There doesn't seem to be an analogous
>> concept in XQuery or XCC. I've tried something along the following lines:
>>
>> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>>
>> and fetching the values in blocks until I run out of values but I'm
>> worried that this isn't very efficient, and I've got this nagging doubt
>> that the above will never return the empty sequence when $start is past
>> the end of the values. I'm not even sure how I should get a count of
>> the number of distinct values (xdmp:estimate doesn't work on the result
>> of cts:element-values()).
>>
>> So - do you guys know of a way of efficiently iterating over a large set
>> of lexicon values without timing out the query on the server?
>>
>> If I'm missing an obvious solution, please let me know...
>>
>> -Steve
>>
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
Damon Feldman
2011-06-02 13:27:04 UTC
Permalink
Steve,

xdmp:spawn() with a high task server queue size will work fine. You could also use CORB, which is a java utility.

As for your existing approach, cts:element-values[$start to $end] will work fine and return an empty sequence past the end of the values, and will be optimized. To get the total number you can count them, since this is a lexicon-only function and returns from the indexes without much overhead - no estimate is necessary.

Yours,
Damon
________________________________________
From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of Steve Mallen [Steve.Mallen at semantico.com]
Sent: Thursday, June 02, 2011 6:32 AM
To: General MarkLogic Developer Discussion
Subject: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Hi all,

I'm having problems processing a large lexicon of values and wondered if
anyone had done something similar or had any ideas of how best to deal
with them.

Basically, I've got a set of several million distinct values, and I want
to precompute a bunch of statistics for each of them (so that I can then
facet/sort values on the computed statistic). So, my plan is to fetch
all the values from the lexicon (storing them in a temp file, say), and
then run an XQuery on each value and store the resulting information in
a document (i.e. one stat document per value). I cannot do this in a
single query as it would take far too long to iterate over all values
and for all the computations and inserts.

But I can't seem to figure out the best way of fetching and iterating
over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
values). In SQL, I'd use a CURSOR to fetch the values one by one, and
then close the cursor at the end. There doesn't seem to be an analogous
concept in XQuery or XCC. I've tried something along the following lines:

(cts:element-values( xs:QName(lexi) ))[$start to $end]

and fetching the values in blocks until I run out of values but I'm
worried that this isn't very efficient, and I've got this nagging doubt
that the above will never return the empty sequence when $start is past
the end of the values. I'm not even sure how I should get a count of
the number of distinct values (xdmp:estimate doesn't work on the result
of cts:element-values()).

So - do you guys know of a way of efficiently iterating over a large set
of lexicon values without timing out the query on the server?

If I'm missing an obvious solution, please let me know...

-Steve
Steve Mallen
2011-06-02 13:44:00 UTC
Permalink
Thanks Damon,

Good to know that the sequence slice method will be optimised. I think
I will do it that way to start with and see how it goes.

I'm not sure of the advantages of using xdmp:spawn() though? - I've
never used it before. Since I will be creating documents as I go (one
per lexicon value) is this something I should beware of? The docs say:

"use care or preferably avoid calling xdmp:spawn from a module that
is performing an update transaction."

I was thinking of just having a controlling Java process which passed
the start and end values to the (update) query, incrementing the values
for each invocation. I would return the number of values processed from
the query, and stop sending queries once I received an empty sequence.
Does that sound reasonable?

Many thanks also to all who responded for your suggestions.
-Steve

On 02/06/2011 14:27, Damon Feldman wrote:
> Steve,
>
> xdmp:spawn() with a high task server queue size will work fine. You could also use CORB, which is a java utility.
>
> As for your existing approach, cts:element-values[$start to $end] will work fine and return an empty sequence past the end of the values, and will be optimized. To get the total number you can count them, since this is a lexicon-only function and returns from the indexes without much overhead - no estimate is necessary.
>
> Yours,
> Damon
> ________________________________________
> From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of Steve Mallen [Steve.Mallen at semantico.com]
> Sent: Thursday, June 02, 2011 6:32 AM
> To: General MarkLogic Developer Discussion
> Subject: [MarkLogic Dev General] Efficient iterating over values from a large lexicon
>
> Hi all,
>
> I'm having problems processing a large lexicon of values and wondered if
> anyone had done something similar or had any ideas of how best to deal
> with them.
>
> Basically, I've got a set of several million distinct values, and I want
> to precompute a bunch of statistics for each of them (so that I can then
> facet/sort values on the computed statistic). So, my plan is to fetch
> all the values from the lexicon (storing them in a temp file, say), and
> then run an XQuery on each value and store the resulting information in
> a document (i.e. one stat document per value). I cannot do this in a
> single query as it would take far too long to iterate over all values
> and for all the computations and inserts.
>
> But I can't seem to figure out the best way of fetching and iterating
> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
> then close the cursor at the end. There doesn't seem to be an analogous
> concept in XQuery or XCC. I've tried something along the following lines:
>
> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>
> and fetching the values in blocks until I run out of values but I'm
> worried that this isn't very efficient, and I've got this nagging doubt
> that the above will never return the empty sequence when $start is past
> the end of the values. I'm not even sure how I should get a count of
> the number of distinct values (xdmp:estimate doesn't work on the result
> of cts:element-values()).
>
> So - do you guys know of a way of efficiently iterating over a large set
> of lexicon values without timing out the query on the server?
>
> If I'm missing an obvious solution, please let me know...
>
> -Steve
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
Steve Mallen
2011-06-02 14:37:48 UTC
Permalink
Hi Damon,

Having run a couple of tests, it seems that doing [$start to $end] is in
fact much slower than using limit from a start value. Running:

cts:element-values( xs:QName('lexi'), ("Z"), "limit=1000" )

takes 4 seconds

while

cts:element-values( xs:QName('lexi') )[1700000 to 1701000]

take 0.003 seconds.

So it seems that the second query is not optimised and in fact loads all
the values into memory first before doing the array slice.

Is this what you expect?

Cheers,
-Steve

On 02/06/2011 14:27, Damon Feldman wrote:

> Steve,
>
> xdmp:spawn() with a high task server queue size will work fine. You could also use CORB, which is a java utility.
>
> As for your existing approach, cts:element-values[$start to $end] will work fine and return an empty sequence past the end of the values, and will be optimized. To get the total number you can count them, since this is a lexicon-only function and returns from the indexes without much overhead - no estimate is necessary.
>
> Yours,
> Damon
> ________________________________________
> From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of Steve Mallen [Steve.Mallen at semantico.com]
> Sent: Thursday, June 02, 2011 6:32 AM
> To: General MarkLogic Developer Discussion
> Subject: [MarkLogic Dev General] Efficient iterating over values from a large lexicon
>
> Hi all,
>
> I'm having problems processing a large lexicon of values and wondered if
> anyone had done something similar or had any ideas of how best to deal
> with them.
>
> Basically, I've got a set of several million distinct values, and I want
> to precompute a bunch of statistics for each of them (so that I can then
> facet/sort values on the computed statistic). So, my plan is to fetch
> all the values from the lexicon (storing them in a temp file, say), and
> then run an XQuery on each value and store the resulting information in
> a document (i.e. one stat document per value). I cannot do this in a
> single query as it would take far too long to iterate over all values
> and for all the computations and inserts.
>
> But I can't seem to figure out the best way of fetching and iterating
> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
> then close the cursor at the end. There doesn't seem to be an analogous
> concept in XQuery or XCC. I've tried something along the following lines:
>
> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>
> and fetching the values in blocks until I run out of values but I'm
> worried that this isn't very efficient, and I've got this nagging doubt
> that the above will never return the empty sequence when $start is past
> the end of the values. I'm not even sure how I should get a count of
> the number of distinct values (xdmp:estimate doesn't work on the result
> of cts:element-values()).
>
> So - do you guys know of a way of efficiently iterating over a large set
> of lexicon values without timing out the query on the server?
>
> If I'm missing an obvious solution, please let me know...
>
> -Steve
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
Mike Sokolov
2011-06-02 15:24:04 UTC
Permalink
Did you swap the two examples below? They seem to contradict your
assertions :)

Also - I'd want to be careful about accounting for caching/paging
behavior in the server. You can get wildly varying timing results on
the first and subsequent invocations of the same query (presumably the
index has been brought into memory then).

-Mike

On 06/02/2011 10:37 AM, Steve Mallen wrote:
> Hi Damon,
>
> Having run a couple of tests, it seems that doing [$start to $end] is in
> fact much slower than using limit from a start value. Running:
>
> cts:element-values( xs:QName('lexi'), ("Z"), "limit=1000" )
>
> takes 4 seconds
>
> while
>
> cts:element-values( xs:QName('lexi') )[1700000 to 1701000]
>
> take 0.003 seconds.
>
> So it seems that the second query is not optimised and in fact loads all
> the values into memory first before doing the array slice.
>
> Is this what you expect?
>
> Cheers,
> -Steve
>
> On 02/06/2011 14:27, Damon Feldman wrote:
>
>
>> Steve,
>>
>> xdmp:spawn() with a high task server queue size will work fine. You could also use CORB, which is a java utility.
>>
>> As for your existing approach, cts:element-values[$start to $end] will work fine and return an empty sequence past the end of the values, and will be optimized. To get the total number you can count them, since this is a lexicon-only function and returns from the indexes without much overhead - no estimate is necessary.
>>
>> Yours,
>> Damon
>> ________________________________________
>> From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of Steve Mallen [Steve.Mallen at semantico.com]
>> Sent: Thursday, June 02, 2011 6:32 AM
>> To: General MarkLogic Developer Discussion
>> Subject: [MarkLogic Dev General] Efficient iterating over values from a large lexicon
>>
>> Hi all,
>>
>> I'm having problems processing a large lexicon of values and wondered if
>> anyone had done something similar or had any ideas of how best to deal
>> with them.
>>
>> Basically, I've got a set of several million distinct values, and I want
>> to precompute a bunch of statistics for each of them (so that I can then
>> facet/sort values on the computed statistic). So, my plan is to fetch
>> all the values from the lexicon (storing them in a temp file, say), and
>> then run an XQuery on each value and store the resulting information in
>> a document (i.e. one stat document per value). I cannot do this in a
>> single query as it would take far too long to iterate over all values
>> and for all the computations and inserts.
>>
>> But I can't seem to figure out the best way of fetching and iterating
>> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
>> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
>> then close the cursor at the end. There doesn't seem to be an analogous
>> concept in XQuery or XCC. I've tried something along the following lines:
>>
>> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>>
>> and fetching the values in blocks until I run out of values but I'm
>> worried that this isn't very efficient, and I've got this nagging doubt
>> that the above will never return the empty sequence when $start is past
>> the end of the values. I'm not even sure how I should get a count of
>> the number of distinct values (xdmp:estimate doesn't work on the result
>> of cts:element-values()).
>>
>> So - do you guys know of a way of efficiently iterating over a large set
>> of lexicon values without timing out the query on the server?
>>
>> If I'm missing an obvious solution, please let me know...
>>
>> -Steve
>>
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
Steve Mallen
2011-06-02 15:31:16 UTC
Permalink
Hi Mike,

You're absolutely correct. I did get them the wrong way round - my bad!

I did however try the array slice query several times with the same
result. In fact I got the same 4 second response time querying for
array size with just 2 values in it - so I'm fairly convinced that the
"limit=n" method is orders of magnitude faster. Also, if you query for
values near the beginning of the list then the results are returned much
faster than values at the end of the list which adds to my suspicions
about how it's implemented.

Cheers,
-Steve

On 02/06/2011 16:24, Mike Sokolov wrote:
> Did you swap the two examples below? They seem to contradict your
> assertions :)
>
> Also - I'd want to be careful about accounting for caching/paging
> behavior in the server. You can get wildly varying timing results on
> the first and subsequent invocations of the same query (presumably the
> index has been brought into memory then).
>
> -Mike
>
> On 06/02/2011 10:37 AM, Steve Mallen wrote:
>> Hi Damon,
>>
>> Having run a couple of tests, it seems that doing [$start to $end] is in
>> fact much slower than using limit from a start value. Running:
>>
>> cts:element-values( xs:QName('lexi'), ("Z"), "limit=1000" )
>>
>> takes 4 seconds
>>
>> while
>>
>> cts:element-values( xs:QName('lexi') )[1700000 to 1701000]
>>
>> take 0.003 seconds.
>>
>> So it seems that the second query is not optimised and in fact loads all
>> the values into memory first before doing the array slice.
>>
>> Is this what you expect?
>>
>> Cheers,
>> -Steve
>>
>> On 02/06/2011 14:27, Damon Feldman wrote:
>>
>>
>>> Steve,
>>>
>>> xdmp:spawn() with a high task server queue size will work fine. You could also use CORB, which is a java utility.
>>>
>>> As for your existing approach, cts:element-values[$start to $end] will work fine and return an empty sequence past the end of the values, and will be optimized. To get the total number you can count them, since this is a lexicon-only function and returns from the indexes without much overhead - no estimate is necessary.
>>>
>>> Yours,
>>> Damon
>>> ________________________________________
>>> From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of Steve Mallen [Steve.Mallen at semantico.com]
>>> Sent: Thursday, June 02, 2011 6:32 AM
>>> To: General MarkLogic Developer Discussion
>>> Subject: [MarkLogic Dev General] Efficient iterating over values from a large lexicon
>>>
>>> Hi all,
>>>
>>> I'm having problems processing a large lexicon of values and wondered if
>>> anyone had done something similar or had any ideas of how best to deal
>>> with them.
>>>
>>> Basically, I've got a set of several million distinct values, and I want
>>> to precompute a bunch of statistics for each of them (so that I can then
>>> facet/sort values on the computed statistic). So, my plan is to fetch
>>> all the values from the lexicon (storing them in a temp file, say), and
>>> then run an XQuery on each value and store the resulting information in
>>> a document (i.e. one stat document per value). I cannot do this in a
>>> single query as it would take far too long to iterate over all values
>>> and for all the computations and inserts.
>>>
>>> But I can't seem to figure out the best way of fetching and iterating
>>> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
>>> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
>>> then close the cursor at the end. There doesn't seem to be an analogous
>>> concept in XQuery or XCC. I've tried something along the following lines:
>>>
>>> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>>>
>>> and fetching the values in blocks until I run out of values but I'm
>>> worried that this isn't very efficient, and I've got this nagging doubt
>>> that the above will never return the empty sequence when $start is past
>>> the end of the values. I'm not even sure how I should get a count of
>>> the number of distinct values (xdmp:estimate doesn't work on the result
>>> of cts:element-values()).
>>>
>>> So - do you guys know of a way of efficiently iterating over a large set
>>> of lexicon values without timing out the query on the server?
>>>
>>> If I'm missing an obvious solution, please let me know...
>>>
>>> -Steve
>>>
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/general
>>>
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
Jason Hunter
2011-06-02 16:01:54 UTC
Permalink
On Jun 2, 2011, at 8:31 AM, Steve Mallen wrote:

> Also, if you query for
> values near the beginning of the list then the results are returned much
> faster than values at the end of the list which adds to my suspicions
> about how it's implemented.

Yes, if you ask to start at the millionth [1000000] item the server is going to linearly scan the first 999,999 items to figure out which one is the millionth. If you start at a value such as "N" then the server can jump more directly to the right starting point (through binary search or lookup tables, depending on configuration).

-jh-
Mike Sokolov
2011-06-02 16:26:58 UTC
Permalink
On 06/02/2011 12:01 PM, Jason Hunter wrote:
> Yes, if you ask to start at the millionth [1000000] item the server is going to linearly scan the first 999,999 items to figure out which one is the millionth. If you start at a value such as "N" then the server can jump more directly to the right starting point (through binary search or lookup tables, depending on configuration).
>
>
I'm curious what sort of configuration would affect that, Jason?

-Mike
Jason Hunter
2011-06-02 17:05:59 UTC
Permalink
On Jun 2, 2011, at 9:26 AM, Mike Sokolov wrote:

> On 06/02/2011 12:01 PM, Jason Hunter wrote:
>> Yes, if you ask to start at the millionth [1000000] item the server is going to linearly scan the first 999,999 items to figure out which one is the millionth. If you start at a value such as "N" then the server can jump more directly to the right starting point (through binary search or lookup tables, depending on configuration).
>>
>>
> I'm curious what sort of configuration would affect that, Jason?

On a database configuration there's a "range index optimize" setting where you can pick:

facet-time (i.e. construct a lookup table)
memory-size (i.e. do binary search)

facet-time is much faster but needs a bit more memory. It's the default now.

memory-size is how things behaved in 4.1 and previous. No lookup table, use binary search.

-jh-
Mike Sokolov
2011-06-02 18:12:55 UTC
Permalink
I see, thanks - I wasn't familiar with those options. And I'm assuming
the situation is similar for lexicons?

-Mike

On 06/02/2011 01:05 PM, Jason Hunter wrote:
> On Jun 2, 2011, at 9:26 AM, Mike Sokolov wrote:
>
>
>> On 06/02/2011 12:01 PM, Jason Hunter wrote:
>>
>>> Yes, if you ask to start at the millionth [1000000] item the server is going to linearly scan the first 999,999 items to figure out which one is the millionth. If you start at a value such as "N" then the server can jump more directly to the right starting point (through binary search or lookup tables, depending on configuration).
>>>
>>>
>>>
>> I'm curious what sort of configuration would affect that, Jason?
>>
> On a database configuration there's a "range index optimize" setting where you can pick:
>
> facet-time (i.e. construct a lookup table)
> memory-size (i.e. do binary search)
>
> facet-time is much faster but needs a bit more memory. It's the default now.
>
> memory-size is how things behaved in 4.1 and previous. No lookup table, use binary search.
>
> -jh-
>
>
Jason Hunter
2011-06-02 19:15:15 UTC
Permalink
I've never confirmed with engineering, but I also assume so.

-jh-

On Jun 2, 2011, at 11:12 AM, Mike Sokolov wrote:

> I see, thanks - I wasn't familiar with those options. And I'm assuming
> the situation is similar for lexicons?
>
> -Mike
>
> On 06/02/2011 01:05 PM, Jason Hunter wrote:
>> On Jun 2, 2011, at 9:26 AM, Mike Sokolov wrote:
>>
>>
>>> On 06/02/2011 12:01 PM, Jason Hunter wrote:
>>>
>>>> Yes, if you ask to start at the millionth [1000000] item the server is going to linearly scan the first 999,999 items to figure out which one is the millionth. If you start at a value such as "N" then the server can jump more directly to the right starting point (through binary search or lookup tables, depending on configuration).
>>>>
>>>>
>>>>
>>> I'm curious what sort of configuration would affect that, Jason?
>>>
>> On a database configuration there's a "range index optimize" setting where you can pick:
>>
>> facet-time (i.e. construct a lookup table)
>> memory-size (i.e. do binary search)
>>
>> facet-time is much faster but needs a bit more memory. It's the default now.
>>
>> memory-size is how things behaved in 4.1 and previous. No lookup table, use binary search.
>>
>> -jh-
>>
>>
Danny Sokolsky
2011-06-02 16:05:14 UTC
Permalink
It makes sense for the subset from the beginning to be faster than one from the middle. The lexicon functions operate on range indexes, and the range index is basically a big sorted list that sits in memory.

Now there are 2 factors that I don't think have been mentioned in this thread that perhaps are worth a mention:

* the hardware in which you are running (amount of memory, speed of memory and computer, etc)
* the number of forests in your database

Because these operations are happening in memory, fast memory can make a difference here. You might see different results on a laptop than on a server machine (and sometimes not in the direction you might guess). But it is possible to distribute the processing somewhat by breaking it into more forests, so there are fewer lexicon entries in a single forest. There are tradeoffs to having multiple forests, and you should be sure you have enough cores on your system for the forests (rule of thumb: 2 cores per forest).

Just more food for thought.

-Danny

-----Original Message-----
From: general-bounces at developer.marklogic.com [mailto:general-bounces at developer.marklogic.com] On Behalf Of Steve Mallen
Sent: Thursday, June 02, 2011 8:31 AM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Hi Mike,

You're absolutely correct. I did get them the wrong way round - my bad!

I did however try the array slice query several times with the same
result. In fact I got the same 4 second response time querying for
array size with just 2 values in it - so I'm fairly convinced that the
"limit=n" method is orders of magnitude faster. Also, if you query for
values near the beginning of the list then the results are returned much
faster than values at the end of the list which adds to my suspicions
about how it's implemented.

Cheers,
-Steve

On 02/06/2011 16:24, Mike Sokolov wrote:
> Did you swap the two examples below? They seem to contradict your
> assertions :)
>
> Also - I'd want to be careful about accounting for caching/paging
> behavior in the server. You can get wildly varying timing results on
> the first and subsequent invocations of the same query (presumably the
> index has been brought into memory then).
>
> -Mike
>
> On 06/02/2011 10:37 AM, Steve Mallen wrote:
>> Hi Damon,
>>
>> Having run a couple of tests, it seems that doing [$start to $end] is in
>> fact much slower than using limit from a start value. Running:
>>
>> cts:element-values( xs:QName('lexi'), ("Z"), "limit=1000" )
>>
>> takes 4 seconds
>>
>> while
>>
>> cts:element-values( xs:QName('lexi') )[1700000 to 1701000]
>>
>> take 0.003 seconds.
>>
>> So it seems that the second query is not optimised and in fact loads all
>> the values into memory first before doing the array slice.
>>
>> Is this what you expect?
>>
>> Cheers,
>> -Steve
>>
>> On 02/06/2011 14:27, Damon Feldman wrote:
>>
>>
>>> Steve,
>>>
>>> xdmp:spawn() with a high task server queue size will work fine. You could also use CORB, which is a java utility.
>>>
>>> As for your existing approach, cts:element-values[$start to $end] will work fine and return an empty sequence past the end of the values, and will be optimized. To get the total number you can count them, since this is a lexicon-only function and returns from the indexes without much overhead - no estimate is necessary.
>>>
>>> Yours,
>>> Damon
>>> ________________________________________
>>> From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of Steve Mallen [Steve.Mallen at semantico.com]
>>> Sent: Thursday, June 02, 2011 6:32 AM
>>> To: General MarkLogic Developer Discussion
>>> Subject: [MarkLogic Dev General] Efficient iterating over values from a large lexicon
>>>
>>> Hi all,
>>>
>>> I'm having problems processing a large lexicon of values and wondered if
>>> anyone had done something similar or had any ideas of how best to deal
>>> with them.
>>>
>>> Basically, I've got a set of several million distinct values, and I want
>>> to precompute a bunch of statistics for each of them (so that I can then
>>> facet/sort values on the computed statistic). So, my plan is to fetch
>>> all the values from the lexicon (storing them in a temp file, say), and
>>> then run an XQuery on each value and store the resulting information in
>>> a document (i.e. one stat document per value). I cannot do this in a
>>> single query as it would take far too long to iterate over all values
>>> and for all the computations and inserts.
>>>
>>> But I can't seem to figure out the best way of fetching and iterating
>>> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
>>> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
>>> then close the cursor at the end. There doesn't seem to be an analogous
>>> concept in XQuery or XCC. I've tried something along the following lines:
>>>
>>> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>>>
>>> and fetching the values in blocks until I run out of values but I'm
>>> worried that this isn't very efficient, and I've got this nagging doubt
>>> that the above will never return the empty sequence when $start is past
>>> the end of the values. I'm not even sure how I should get a count of
>>> the number of distinct values (xdmp:estimate doesn't work on the result
>>> of cts:element-values()).
>>>
>>> So - do you guys know of a way of efficiently iterating over a large set
>>> of lexicon values without timing out the query on the server?
>>>
>>> If I'm missing an obvious solution, please let me know...
>>>
>>> -Steve
>>>
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> _______________________________________________
>>> General mailing list
>>> General at developer.marklogic.com
>>> http://developer.marklogic.com/mailman/listinfo/general
>>>
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
Damon Feldman
2011-06-02 17:38:50 UTC
Permalink
Steve,

Yes, that's how it should work. [1 to 100] is optimized like limit=100. However [1000000 to 1000100] needs to do more work. Your use of "Z" to start at a particular value in the lexicon is accomplished by a binary search over the in-memory sorted list that is the lexicon, so works fast.

Your examples are a bit apples-to-oranges because

cts:element-values( xs:QName('lexi'), ("Z"), "limit=1000" )

should really be compared to

cts:element-values( xs:QName('lexi'), ("Z"))[1 to 1000]

rather than [1700000 to 1701000].

For 1.7MM+ items, I'd definitely consider CORB, but let us know if the task server approach works. Either the task server or CORB will run multithreaded which is a benefit if you have multiple cores.



The warning in the docs about xdmp:spawn() and updates is just to say that once spawned, you have a separate, asynchronous transaction that can't be rolled back, so should be aware of that. The other issue is that if the server is shut down during the overall batch, the tasks will be lost.



Yours,

Damon


________________________________________
From: Steve Mallen [Steve.Mallen at semantico.com]
Sent: Thursday, June 02, 2011 10:37 AM
To: Damon Feldman
Cc: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Efficient iterating over values from a large lexicon

Hi Damon,

Having run a couple of tests, it seems that doing [$start to $end] is in
fact much slower than using limit from a start value. Running:

cts:element-values( xs:QName('lexi'), ("Z"), "limit=1000" )

takes 4 seconds

while

cts:element-values( xs:QName('lexi') )[1700000 to 1701000]

take 0.003 seconds.

So it seems that the second query is not optimised and in fact loads all
the values into memory first before doing the array slice.

Is this what you expect?

Cheers,
-Steve

On 02/06/2011 14:27, Damon Feldman wrote:

> Steve,
>
> xdmp:spawn() with a high task server queue size will work fine. You could also use CORB, which is a java utility.
>
> As for your existing approach, cts:element-values[$start to $end] will work fine and return an empty sequence past the end of the values, and will be optimized. To get the total number you can count them, since this is a lexicon-only function and returns from the indexes without much overhead - no estimate is necessary.
>
> Yours,
> Damon
> ________________________________________
> From: general-bounces at developer.marklogic.com [general-bounces at developer.marklogic.com] On Behalf Of Steve Mallen [Steve.Mallen at semantico.com]
> Sent: Thursday, June 02, 2011 6:32 AM
> To: General MarkLogic Developer Discussion
> Subject: [MarkLogic Dev General] Efficient iterating over values from a large lexicon
>
> Hi all,
>
> I'm having problems processing a large lexicon of values and wondered if
> anyone had done something similar or had any ideas of how best to deal
> with them.
>
> Basically, I've got a set of several million distinct values, and I want
> to precompute a bunch of statistics for each of them (so that I can then
> facet/sort values on the computed statistic). So, my plan is to fetch
> all the values from the lexicon (storing them in a temp file, say), and
> then run an XQuery on each value and store the resulting information in
> a document (i.e. one stat document per value). I cannot do this in a
> single query as it would take far too long to iterate over all values
> and for all the computations and inserts.
>
> But I can't seem to figure out the best way of fetching and iterating
> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
> then close the cursor at the end. There doesn't seem to be an analogous
> concept in XQuery or XCC. I've tried something along the following lines:
>
> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>
> and fetching the values in blocks until I run out of values but I'm
> worried that this isn't very efficient, and I've got this nagging doubt
> that the above will never return the empty sequence when $start is past
> the end of the values. I'm not even sure how I should get a count of
> the number of distinct values (xdmp:estimate doesn't work on the result
> of cts:element-values()).
>
> So - do you guys know of a way of efficiently iterating over a large set
> of lexicon values without timing out the query on the server?
>
> If I'm missing an obvious solution, please let me know...
>
> -Steve
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20110602/5fae79a3/attachment.html
Michael Blakeley
2011-06-02 14:05:58 UTC
Permalink
Steve, the suggestions related to spawn and limits are good, but you might want to back up and reconsider the problem. Naturally you know the details of your problem best, but it might be possible to do some or all of the work more efficiently using existing product features.

-- Mike

On 2 Jun 2011, at 03:32 , Steve Mallen wrote:

> Hi all,
>
> I'm having problems processing a large lexicon of values and wondered if
> anyone had done something similar or had any ideas of how best to deal
> with them.
>
> Basically, I've got a set of several million distinct values, and I want
> to precompute a bunch of statistics for each of them (so that I can then
> facet/sort values on the computed statistic). So, my plan is to fetch
> all the values from the lexicon (storing them in a temp file, say), and
> then run an XQuery on each value and store the resulting information in
> a document (i.e. one stat document per value). I cannot do this in a
> single query as it would take far too long to iterate over all values
> and for all the computations and inserts.
>
> But I can't seem to figure out the best way of fetching and iterating
> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
> then close the cursor at the end. There doesn't seem to be an analogous
> concept in XQuery or XCC. I've tried something along the following lines:
>
> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>
> and fetching the values in blocks until I run out of values but I'm
> worried that this isn't very efficient, and I've got this nagging doubt
> that the above will never return the empty sequence when $start is past
> the end of the values. I'm not even sure how I should get a count of
> the number of distinct values (xdmp:estimate doesn't work on the result
> of cts:element-values()).
>
> So - do you guys know of a way of efficiently iterating over a large set
> of lexicon values without timing out the query on the server?
>
> If I'm missing an obvious solution, please let me know...
>
> -Steve
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
>
Nuno Job
2011-06-02 14:35:15 UTC
Permalink
An example of what Michael said would be use element values with option
frequency order and cts:frequency. You might be assuming we can't do
something that we are perfectly optimized to do.

If that's not the case the recommendations on spawn and corb make sense. As
for spawn with an update: just remember spawn won't be rolled back and that
might be why the documentation says that. Ideally you should have a query
statement that spawns update statements.

Makes sense?

Nuno
On Jun 2, 2011 3:06 PM, "Michael Blakeley" <mike at blakeley.com> wrote:
> Steve, the suggestions related to spawn and limits are good, but you might
want to back up and reconsider the problem. Naturally you know the details
of your problem best, but it might be possible to do some or all of the work
more efficiently using existing product features.
>
> -- Mike
>
> On 2 Jun 2011, at 03:32 , Steve Mallen wrote:
>
>> Hi all,
>>
>> I'm having problems processing a large lexicon of values and wondered if
>> anyone had done something similar or had any ideas of how best to deal
>> with them.
>>
>> Basically, I've got a set of several million distinct values, and I want
>> to precompute a bunch of statistics for each of them (so that I can then
>> facet/sort values on the computed statistic). So, my plan is to fetch
>> all the values from the lexicon (storing them in a temp file, say), and
>> then run an XQuery on each value and store the resulting information in
>> a document (i.e. one stat document per value). I cannot do this in a
>> single query as it would take far too long to iterate over all values
>> and for all the computations and inserts.
>>
>> But I can't seem to figure out the best way of fetching and iterating
>> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
>> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
>> then close the cursor at the end. There doesn't seem to be an analogous
>> concept in XQuery or XCC. I've tried something along the following lines:
>>
>> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>>
>> and fetching the values in blocks until I run out of values but I'm
>> worried that this isn't very efficient, and I've got this nagging doubt
>> that the above will never return the empty sequence when $start is past
>> the end of the values. I'm not even sure how I should get a count of
>> the number of distinct values (xdmp:estimate doesn't work on the result
>> of cts:element-values()).
>>
>> So - do you guys know of a way of efficiently iterating over a large set
>> of lexicon values without timing out the query on the server?
>>
>> If I'm missing an obvious solution, please let me know...
>>
>> -Steve
>>
>> _______________________________________________
>> General mailing list
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>>
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20110602/ec65f342/attachment-0001.html
Steve Mallen
2011-06-02 14:47:36 UTC
Permalink
Hi Nuno,

Unfortunately, I don't think it's that simple. I have a set of
documents each which each contain a single integer statistic (let's call
it "number of sales" for sake of argument). This datum is not computed
by our system, but is provided by an external party and matched to each
document. Each document (item) also contains metadata about the item,
such as title, color, flavour etc. These have been put into lexicons
for fast faceting of search results.

But we now have a new requirement whereby we want to know the the
*total* number of sales for each facet, and show the top ranking
(highest total sales) of colors, titles, etc. So I effectively need to
order by the sum of all "number of sales" of all items matching a
facet. As far as I know there is no way to facet on a computed value.
To add to the problem, some of the lexicons have millions of distinct
values.

Therefore the only solution I can think of is to iterate over all
distinct values and pre-compute these values. I can then add a range
index on the computed value and order by this sum.

Hope I have explained this clearly...

-Steve

On 02/06/2011 15:35, Nuno Job wrote:
>
> An example of what Michael said would be use element values with
> option frequency order and cts:frequency. You might be assuming we
> can't do something that we are perfectly optimized to do.
>
> If that's not the case the recommendations on spawn and corb make
> sense. As for spawn with an update: just remember spawn won't be
> rolled back and that might be why the documentation says that. Ideally
> you should have a query statement that spawns update statements.
>
> Makes sense?
>
> Nuno
>
> On Jun 2, 2011 3:06 PM, "Michael Blakeley" <mike at blakeley.com
> <mailto:mike at blakeley.com>> wrote:
> > Steve, the suggestions related to spawn and limits are good, but you
> might want to back up and reconsider the problem. Naturally you know
> the details of your problem best, but it might be possible to do some
> or all of the work more efficiently using existing product features.
> >
> > -- Mike
> >
> > On 2 Jun 2011, at 03:32 , Steve Mallen wrote:
> >
> >> Hi all,
> >>
> >> I'm having problems processing a large lexicon of values and
> wondered if
> >> anyone had done something similar or had any ideas of how best to deal
> >> with them.
> >>
> >> Basically, I've got a set of several million distinct values, and I
> want
> >> to precompute a bunch of statistics for each of them (so that I can
> then
> >> facet/sort values on the computed statistic). So, my plan is to fetch
> >> all the values from the lexicon (storing them in a temp file, say),
> and
> >> then run an XQuery on each value and store the resulting
> information in
> >> a document (i.e. one stat document per value). I cannot do this in a
> >> single query as it would take far too long to iterate over all values
> >> and for all the computations and inserts.
> >>
> >> But I can't seem to figure out the best way of fetching and iterating
> >> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
> >> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
> >> then close the cursor at the end. There doesn't seem to be an
> analogous
> >> concept in XQuery or XCC. I've tried something along the following
> lines:
> >>
> >> (cts:element-values( xs:QName(lexi) ))[$start to $end]
> >>
> >> and fetching the values in blocks until I run out of values but I'm
> >> worried that this isn't very efficient, and I've got this nagging
> doubt
> >> that the above will never return the empty sequence when $start is
> past
> >> the end of the values. I'm not even sure how I should get a count of
> >> the number of distinct values (xdmp:estimate doesn't work on the
> result
> >> of cts:element-values()).
> >>
> >> So - do you guys know of a way of efficiently iterating over a
> large set
> >> of lexicon values without timing out the query on the server?
> >>
> >> If I'm missing an obvious solution, please let me know...
> >>
> >> -Steve
> >>
> >> _______________________________________________
> >> General mailing list
> >> General at developer.marklogic.com
> <mailto:General at developer.marklogic.com>
> >> http://developer.marklogic.com/mailman/listinfo/general
> >>
> >
> > _______________________________________________
> > General mailing list
> > General at developer.marklogic.com <mailto:General at developer.marklogic.com>
> > http://developer.marklogic.com/mailman/listinfo/general
>
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://developer.marklogic.com/pipermail/general/attachments/20110602/2e258b6b/attachment.html
Michael Blakeley
2011-06-02 15:23:14 UTC
Permalink
Steve, that makes sense to me. In order to use a range index on that sum, you would have to put its value into the database in an XML element or element-attribute. You could do that. You could move that work to ingestion time, using a CPF pipeline. The pipeline would use triggers to update the relevant meta-documents or summary documents for whatever values are in each update. That would turn each summary document into a serialization point, but that might be ok depending on your performance requirements and the frequency of updates.

But if you stay on the path you are on now, I would recommend using xdmp:spawn. That will let your code take better advantage of parallel processing. I would structure this using at least two modules: one to process a single batch of N values, and one to spawn off M batches of N values each. Here's an outline: you could tweak the batch size and element-values call to suit your needs. You might need a larger batch to avoid recursion limits, for example.

(: list all uris and recursively spawn batches :)
declare variable $SIZE := 100 ;

declare function local:spawn(
$values as xs:string* )
as empty-sequence()
{
let $batch := subsequence($uris, 1, $SIZE)
let $rest := subsequence($uris, 1 + $SIZE)
let $log := xdmp:log(text {
"batch count", count($batch),
"rest", count($rest) },
"info")
let $spawn := if (empty($batch)) then () else xdmp:spawn(
$MODULE-BATCH,
(xs:QName("VALUES-JSON"), xdmp:to-json($batch)),
$OPTIONS )
where exists($rest)
return local:spawn($rest)
};

local:spawn(cts:element-values(...))

(: process one batch :)
declare variable $VALUES-JSON as xs:string external ;

xdmp:log(text { "batch", $VALUES-JSON }, "info"),
for $v in xdmp:from-json($VALUES-JSON) ;
(: do stuff :)

-- Mike

On 2 Jun 2011, at 07:47 , Steve Mallen wrote:

> Hi Nuno,
>
> Unfortunately, I don't think it's that simple. I have a set of documents each which each contain a single integer statistic (let's call it "number of sales" for sake of argument). This datum is not computed by our system, but is provided by an external party and matched to each document. Each document (item) also contains metadata about the item, such as title, color, flavour etc. These have been put into lexicons for fast faceting of search results.
>
> But we now have a new requirement whereby we want to know the the *total* number of sales for each facet, and show the top ranking (highest total sales) of colors, titles, etc. So I effectively need to order by the sum of all "number of sales" of all items matching a facet. As far as I know there is no way to facet on a computed value. To add to the problem, some of the lexicons have millions of distinct values.
>
> Therefore the only solution I can think of is to iterate over all distinct values and pre-compute these values. I can then add a range index on the computed value and order by this sum.
>
> Hope I have explained this clearly...
>
> -Steve
>
> On 02/06/2011 15:35, Nuno Job wrote:
>> An example of what Michael said would be use element values with option frequency order and cts:frequency. You might be assuming we can't do something that we are perfectly optimized to do.
>>
>> If that's not the case the recommendations on spawn and corb make sense. As for spawn with an update: just remember spawn won't be rolled back and that might be why the documentation says that. Ideally you should have a query statement that spawns update statements.
>>
>> Makes sense?
>>
>> Nuno
>>
>> On Jun 2, 2011 3:06 PM, "Michael Blakeley" <mike at blakeley.com> wrote:
>> > Steve, the suggestions related to spawn and limits are good, but you might want to back up and reconsider the problem. Naturally you know the details of your problem best, but it might be possible to do some or all of the work more efficiently using existing product features.
>> >
>> > -- Mike
>> >
>> > On 2 Jun 2011, at 03:32 , Steve Mallen wrote:
>> >
>> >> Hi all,
>> >>
>> >> I'm having problems processing a large lexicon of values and wondered if
>> >> anyone had done something similar or had any ideas of how best to deal
>> >> with them.
>> >>
>> >> Basically, I've got a set of several million distinct values, and I want
>> >> to precompute a bunch of statistics for each of them (so that I can then
>> >> facet/sort values on the computed statistic). So, my plan is to fetch
>> >> all the values from the lexicon (storing them in a temp file, say), and
>> >> then run an XQuery on each value and store the resulting information in
>> >> a document (i.e. one stat document per value). I cannot do this in a
>> >> single query as it would take far too long to iterate over all values
>> >> and for all the computations and inserts.
>> >>
>> >> But I can't seem to figure out the best way of fetching and iterating
>> >> over a Lexicon in MarkLogic (to pre-fetch the full set of lexicon
>> >> values). In SQL, I'd use a CURSOR to fetch the values one by one, and
>> >> then close the cursor at the end. There doesn't seem to be an analogous
>> >> concept in XQuery or XCC. I've tried something along the following lines:
>> >>
>> >> (cts:element-values( xs:QName(lexi) ))[$start to $end]
>> >>
>> >> and fetching the values in blocks until I run out of values but I'm
>> >> worried that this isn't very efficient, and I've got this nagging doubt
>> >> that the above will never return the empty sequence when $start is past
>> >> the end of the values. I'm not even sure how I should get a count of
>> >> the number of distinct values (xdmp:estimate doesn't work on the result
>> >> of cts:element-values()).
>> >>
>> >> So - do you guys know of a way of efficiently iterating over a large set
>> >> of lexicon values without timing out the query on the server?
>> >>
>> >> If I'm missing an obvious solution, please let me know...
>> >>
>> >> -Steve
>> >>
>> >> _______________________________________________
>> >> General mailing list
>> >> General at developer.marklogic.com
>> >> http://developer.marklogic.com/mailman/listinfo/general
>> >>
>> >
>> > _______________________________________________
>> > General mailing list
>> > General at developer.marklogic.com
>> > http://developer.marklogic.com/mailman/listinfo/general
>>
>> _______________________________________________
>> General mailing list
>>
>> General at developer.marklogic.com
>> http://developer.marklogic.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> General at developer.marklogic.com
> http://developer.marklogic.com/mailman/listinfo/general
Kelly Stirman
2011-06-02 16:07:02 UTC
Permalink
------------------------------
Hi Steve,

Fun problem. :)

We could effectively facet by a computed sum, but whether that is practical depends on the cardinality of your facets.

If you don't have too many colors, flavors, title, etc, you can compute the sum of the sales value for each unique value, then order them for display purposes. cts:sum lets you do this for sum in the same way cts:frequency lets you do this for count. Unfortunately, there's not an option to return sum instead of count in the SearchAPI, but you could use the rest of the SearchAPI to build your queries and perform other tasks.

Putting the sum in the database makes everything pretty easy if you don't think you need to manage updates to those sums.

Kelly

Message: 2
Date: Thu, 02 Jun 2011 15:47:36 +0100
From: Steve Mallen <Steve.Mallen at semantico.com>
Subject: Re: [MarkLogic Dev General] Efficient iterating over values
from a large lexicon
To: general at developer.marklogic.com
Message-ID: <4DE7A288.2070309 at semantico.com>
Content-Type: text/plain; charset="utf-8"

Hi Nuno,

Unfortunately, I don't think it's that simple. I have a set of documents each which each contain a single integer statistic (let's call it "number of sales" for sake of argument). This datum is not computed by our system, but is provided by an external party and matched to each document. Each document (item) also contains metadata about the item, such as title, color, flavour etc. These have been put into lexicons for fast faceting of search results.

But we now have a new requirement whereby we want to know the the
*total* number of sales for each facet, and show the top ranking (highest total sales) of colors, titles, etc. So I effectively need to order by the sum of all "number of sales" of all items matching a facet. As far as I know there is no way to facet on a computed value.
To add to the problem, some of the lexicons have millions of distinct values.

Therefore the only solution I can think of is to iterate over all distinct values and pre-compute these values. I can then add a range index on the computed value and order by this sum.

Hope I have explained this clearly...

-Steve
Loading...