Share your knowledge: Finding duplicate values with XQuery

Thursday, March 14, 2013

Finding duplicate values with XQuery

As I'm always interested in benchmarks and optimized solutions I compared 3 strategies of finding duplicate values in a big sequence. This test will use 5000 randomly generated numbers and compare performance.
The first thing I needed to do was creating a sequence of 5000 randomly generated numbers. Scala comes to the rescue

Loading ....

This generates a file with following content. Remark: in reality that sequence contains 5000 numbers.

let $values := (1012,5345,2891,3833,2854, 2236)

Now I ran the following XQueries on Zorba to get a feeling about how fast they are.

XQuery 1: (takes about 5 seconds for 5000 numbers)

let $values := (1012,5345,2891,3833,2854, 2236)
let $distinctvalues := distinct-values($values)
let $nonunique := for $value in $distinctvalues return if (count(index-of($values, $value)) > 1) then $value else ()
return $nonunique

XQuery 2: (takes about 5 seconds for 5000 numbers)

let $values := (1012,5345,2891,3833,2854, 2236)
return $values[index-of($values, .)[2]]

XQuery 3: (takes about 1 seconds for 5000 numbers)

let $values := (1012,5345,2891,3833,2854, 2236)

return distinct-values(for $value in $values
  return if (count($values[. eq $value]) > 1)
         then $value
         else ())

Of course I got intrigued how Sedna would perform. I only tried XQuery1 (3 to 4 seconds) and XQuery3 (around 13 seconds) on my local machine, which only shows that you cannot always take the same approach while trying to optimize.

7 comments:

Johan MörénMarch 14, 2013 at 10:41 AM
Interesting

Ran the same tests with BaseX 7.5 and got these numbers.

1: 207.88 ms
2: 224.35 ms
3: 656.39 ms

(random number generation not included in timing)
ReplyDelete
Replies
UnknownMarch 14, 2013 at 4:15 PM
Did you test with 5000 numbers or just the queries above?
ReplyDelete
Replies
UnknownMarch 15, 2013 at 3:22 AM
I just tried the 3 queries with zorba, on a lenovo T500 running linux 12.04.

The 3rd query took only 65ms. The other two were indeed much slower, taking around 4.3sec.

I used exactly the same scala program to generate the 5000 values. The queries return 898 values with duplicates.

Markos.

ReplyDelete
Replies
UnknownMarch 15, 2013 at 4:19 AM
Thx for confirming this Markos. But I am of course intrigued how Johan tested above queries on BaseX. I am aware that there is no such thing as saying... this will work faster in all cases with any implementation. But from algorithmic point of view one can of course use better strategies nonetheless.
ReplyDelete
Replies
Johan MörénMarch 15, 2013 at 5:25 AM
Published my script as a Gist here https://gist.github.com/hutchkintoot/5169467

Also added a comment displaying the optimizations that BaseX does for each query.

/Johan
ReplyDelete
Replies
chris wallaceMarch 19, 2013 at 10:55 AM
Good question - I added a new set to my Cookbook of XQuery patterns running on an eXist 2.0 server. Performance is very variable - the second example revealed a bug which will be fixed in the next release, and the third one is very slow. Wolfgang suggested using group and this indeed is blistering fast.

http://kitwallace.co.uk/Book/set/duplicate-numbers

Chris Wallace
ReplyDelete
Replies

Add comment