If you're interested in functional programming, you might also want to checkout my second blog which i'm actively working on!!

Thursday, March 14, 2013

Finding duplicate values with XQuery

As I'm always interested in benchmarks and optimized solutions I compared 3 strategies of finding duplicate values in a big sequence. This test will use 5000 randomly generated numbers and compare performance.
The first thing I needed to do was creating a sequence of 5000 randomly generated numbers. Scala comes to the rescue
Loading ....

This generates a file with following content. Remark: in reality that sequence contains 5000 numbers.
let $values := (1012,5345,2891,3833,2854, 2236)

Now I ran the following XQueries on Zorba to get a feeling about how fast they are.

XQuery 1: (takes about 5 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)
let $distinctvalues := distinct-values($values)
let $nonunique := for $value in $distinctvalues return if (count(index-of($values, $value)) > 1) then $value else ()
return $nonunique

XQuery 2: (takes about 5 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)
return $values[index-of($values, .)[2]]

XQuery 3: (takes about 1 seconds for 5000 numbers)
let $values := (1012,5345,2891,3833,2854, 2236)

return distinct-values(for $value in $values
  return if (count($values[. eq $value]) > 1)
         then $value
         else ())

Of course I got intrigued how Sedna would perform. I only tried XQuery1 (3 to 4 seconds) and XQuery3 (around 13 seconds) on my local machine, which only shows that you cannot always take the same approach while trying to optimize.