If you're interested in functional programming, you might also want to checkout my second blog which i'm actively working on!!

Thursday, March 14, 2013

Finding duplicate values with XQuery

As I'm always interested in benchmarks and optimized solutions I compared 3 strategies of finding duplicate values in a big sequence. This test will use 5000 randomly generated numbers and compare performance.
The first thing I needed to do was creating a sequence of 5000 randomly generated numbers. Scala comes to the rescue
Loading ....

This generates a file with following content. Remark: in reality that sequence contains 5000 numbers.
let $values := (1012,5345,2891,3833,2854, 2236)

Now I ran the following XQueries on Zorba to get a feeling about how fast they are.

XQuery 1: (takes about 5 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)
let $distinctvalues := distinct-values($values)
let $nonunique := for $value in $distinctvalues return if (count(index-of($values, $value)) > 1) then $value else ()
return $nonunique

XQuery 2: (takes about 5 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)
return $values[index-of($values, .)[2]]

XQuery 3: (takes about 1 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)

return distinct-values(for $value in $values
  return if (count($values[. eq $value]) > 1)
         then $value
         else ())

Of course I got intrigued how Sedna would perform. I only tried XQuery1 (3 to 4 seconds) and XQuery3 (around 13 seconds) on my local machine, which only shows that you cannot always take the same approach while trying to optimize.

7 comments:

  1. Interesting

    Ran the same tests with BaseX 7.5 and got these numbers.

    1: 207.88 ms
    2: 224.35 ms
    3: 656.39 ms

    (random number generation not included in timing)

    ReplyDelete
  2. Did you test with 5000 numbers or just the queries above?

    ReplyDelete
  3. I just tried the 3 queries with zorba, on a lenovo T500 running linux 12.04.

    The 3rd query took only 65ms. The other two were indeed much slower, taking around 4.3sec.

    I used exactly the same scala program to generate the 5000 values. The queries return 898 values with duplicates.

    Markos.

    ReplyDelete
  4. Thx for confirming this Markos. But I am of course intrigued how Johan tested above queries on BaseX. I am aware that there is no such thing as saying... this will work faster in all cases with any implementation. But from algorithmic point of view one can of course use better strategies nonetheless.

    ReplyDelete
  5. Published my script as a Gist here https://gist.github.com/hutchkintoot/5169467

    Also added a comment displaying the optimizations that BaseX does for each query.

    /Johan

    ReplyDelete
  6. Good question - I added a new set to my Cookbook of XQuery patterns running on an eXist 2.0 server. Performance is very variable - the second example revealed a bug which will be fixed in the next release, and the third one is very slow. Wolfgang suggested using group and this indeed is blistering fast.

    http://kitwallace.co.uk/Book/set/duplicate-numbers

    Chris Wallace

    ReplyDelete
    Replies
    1. Nice to see simple tests like these can produce some nice side effects :)

      Delete