Thursday, March 14, 2013

Finding duplicate values with XQuery

As I'm always interested in benchmarks and optimized solutions I compared 3 strategies of finding duplicate values in a big sequence. This test will use 5000 randomly generated numbers and compare performance.
The first thing I needed to do was creating a sequence of 5000 randomly generated numbers. Scala comes to the rescue
Loading ....

This generates a file with following content. Remark: in reality that sequence contains 5000 numbers.
let $values := (1012,5345,2891,3833,2854, 2236)

Now I ran the following XQueries on Zorba to get a feeling about how fast they are.

XQuery 1: (takes about 5 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)
let $distinctvalues := distinct-values($values)
let $nonunique := for $value in $distinctvalues return if (count(index-of($values, $value)) > 1) then $value else ()
return $nonunique

XQuery 2: (takes about 5 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)
return $values[index-of($values, .)[2]]

XQuery 3: (takes about 1 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)

return distinct-values(for $value in $values
  return if (count($values[. eq $value]) > 1)
         then $value
         else ())

Of course I got intrigued how Sedna would perform. I only tried XQuery1 (3 to 4 seconds) and XQuery3 (around 13 seconds) on my local machine, which only shows that you cannot always take the same approach while trying to optimize.