algorithm - Using vectors for performance improvement in Haskell

Question

Welcome To Ask or Share your Answers For Others

algorithm - Using vectors for performance improvement in Haskell

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

algorithm - Using vectors for performance improvement in Haskell

I'm very new to Haskell, and I have a question about what performance improvements can be had by using impure (mutable) data structures. I'm trying to piece together a few different things I've heard, so please bear with me if my terminology is not entirely correct, or if there are some small errors.

To make this concrete, consider the quicksort algorithm (taken from the Haskell wiki).

quicksort :: Ord a => [a] -> [a]
quicksort []     = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
    where
        lesser  = filter (< p) xs
        greater = filter (>= p) xs

This is not "true quicksort." A "true" quicksort algorithm is in-place, and this is not. This is very memory inefficient.

On the other hand, it is possible to use vectors in Haskell to implement an in-place quicksort. An example is given in this stackoverflow answer.

How much faster is the second algorithm than the first? Big O notation doesn't help here, because the performance improvement is going to be from using memory more efficiently, not having a better algorithm (right?). I tired to construct some test cases on my own, but I had difficult getting things running.

An ideal answer would give some idea of what makes the in-place Haskell algorithm faster theoretically, and an example comparison of running times on some test data set.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:30:30+0000

On the other hand, it is possible to use vectors in Haskell to implement an in-place quicksort.

How much faster is the second algorithm than the first?

That depends on the implementation, of course. As can be seen below, for not too short lists, a decent in-place sort on a mutable vector or array is much faster than sorting lists, even if the time for the transformation from and to lists is included (and that conversion makes up the bulk of the time).

However, the list algorithms produce incremental output, while the array/vector algorithms don't produce any result before they have completed, therefore sorting lists can still be preferable.

I don't know exactly what the linked mutable array/vector algorithms did wrong. But they did something quite wrong.

For the mutable vector code, it seems that it used boxed vectors, and it was polymorphic, both can have significant performance impact, though the polymorphism shouldn't matter if the functions are {-# INLINABLE #-}.

For the IOUArray code, well, it looks fun, but slow. It uses an IORef, readArray and writeArray and has no obvious strictness. The abysmal times it takes aren't too surprising, then.

Using a more direct translation of the (monomorphic) C code using an STUArray, with a wrapper to make it work on lists1,

{-# LANGUAGE BangPatterns #-}
module STUQuickSort (stuquick) where

import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST

stuquick :: [Int] -> [Int]
stuquick [] = []
stuquick xs = runST (do
    let !len = length xs
    arr <- newListArray (0,len-1) xs
    myqsort arr 0 (len-1)
    -- Can't use getElems for large arrays, that overflows the stack, wth?
    let pick acc i
            | i < 0     = return acc
            | otherwise = do
                !v <- unsafeRead arr i
                pick (v:acc) (i-1)
    pick [] (len-1))

myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
   | lo < hi   = do
       let lscan p h i
               | i < h = do
                   v <- unsafeRead a i
                   if p < v then return i else lscan p h (i+1)
               | otherwise = return i
           rscan p l i
               | l < i = do
                   v <- unsafeRead a i
                   if v < p then return i else rscan p l (i-1)
               | otherwise = return i
           swap i j = do
               v <- unsafeRead a i
               unsafeRead a j >>= unsafeWrite a i
               unsafeWrite a j v
           sloop p l h
               | l < h = do
                   l1 <- lscan p h l
                   h1 <- rscan p l1 h
                   if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
               | otherwise = return l
       piv <- unsafeRead a hi
       i <- sloop piv lo hi
       swap i hi
       myqsort a lo (i-1)
       myqsort a (i+1) hi
   | otherwise = return ()

and a wrapper around a good sort (Introsort, not quicksort) on unboxed vectors,

module VSort where

import Data.Vector.Algorithms.Intro
import qualified Data.Vector.Unboxed as U
import Control.Monad.ST

vsort :: [Int] -> [Int]
vsort xs = runST (do
    v <- U.unsafeThaw $ U.fromList xs
    sort v
    s <- U.unsafeFreeze v
    return $ U.toList s)

I get times more in line with the expectations (Note: For these timings, the random list has been deepseqed before calling the sorting algorithm. Without that, the conversion to an STUArray would be much slower, since it would first evaluate a long list of thunks to determine the length. The fromList conversion of the vector package doesn't suffer from this problem. Moving the deepseq to the conversion to STUArray, the other sorting [and conversion, in the vector case] algorithms take a little less time, so the difference between vector-algorithms' introsort and the STUArray quicksort becomes a little larger.):

list size: 200000                    -O2     -fllvm  -fllvm-O2
────────             ────────   ────────   ────────   ────────
Data.List.sort      0.663501s  0.665482s  0.652461s  0.792005s
Naive.quicksort     0.587091s  0.577796s  0.585754s  0.667573s
STUArray.quicksort   1.58023s  0.142626s  1.597479s  0.156411s
VSort.vsort         0.820639s  0.139967s  0.888566s  0.143918s

The times without optimisation are expectedly bad for the STUArray. unsafeRead and unsafeWrite must be inlined to be fast. If not inlined, you get a dictionary lookup for each call. Thus for the large dataset, I omit the unoptimised ways:

list size: 3000000         -O2   -fllvm-O2
────────              ────────    ────────
Data.List.sort      16.728576s  16.442377s
Naive.quicksort     14.297534s  12.253071s
STUArray.quicksort   2.307203s   2.200807s
VSort.vsort          2.069749s   1.921943s

You can see that an inplace sort on a mutable unboxed array is much faster than a list-based sort if done correctly. Whether the difference between the STUArray sort and the sort on the unboxed mutable vector is due to the different algorithm or whether vectors are indeed faster here, I don't know. Since I've never observed vectors to be faster2 than STUArrays, I tend to believe the former. The difference between the STUArray quicksort and the introsort is in part due to the better conversion from and to lists that the vector package offers, in part due to the different algorithms.

At Louis Wasserman's suggestion, I have run a quick benchmark using the other sorting algorithms from the vector-algorithms package, using a not-too-large dataset. The results aren't surprising, the good general-purpose algorithms heapsort, introsort and mergesort all do well, times near the quicksort on the unboxed mutable array (but of course, the quicksort would degrade to quadratic behaviour on almost sorted input, while these are guaranteed O(n*log n) worst case). The special-purpose sorting algorithms AmericanFlag and radix sort do badly, since the input doesn't fit well to their purpose (radix sort would do better on larger inputs with a larger range, as is, it does too many more passes than needed for the data). Insertion sort is by far the worst, due to its quadratic behaviour.

AmericanFlag:
list size: 300000         -O2  -fllvm-O2
────────             ────────   ────────
Data.List.sort      1.083845s  1.084699s
Naive.quicksort     0.981276s   1.05532s
STUArray.quicksort  0.218407s  0.215564s
VSort.vsort         2.566838s  2.618817s

Heap:
list size: 300000         -O2  -fllvm-O2
────────             ────────   ────────
Data.List.sort      1.084252s   1.07894s
Naive.quicksort     0.915984s  0.887354s
STUArray.quicksort  0.219786s  0.225748s
VSort.vsort         0.213507s   0.20152s

Insertion:
list size: 300000         -O2   -fllvm-O2
────────             ────────    ────────
Data.List.sort      1.168837s   1.066058s
Naive.quicksort     1.081806s   0.879439s
STUArray.quicksort  0.241958s   0.209631s
VSort.vsort         36.21295s  27.564993s

Intro:
list size: 300000         -O2  -fllvm-O2
────────             ────────   ────────
Data.List.sort       1.09189s  1.112415s
Naive.quicksort     0.891161s  0.989799s
STUArray.quicksort  0.236596s  0.227348s
VSort.vsort         0.221742s   0.20815s

Merge:
list size: 300000         -O2  -fllvm-O2
────────             ────────   ────────
Data.List.sort      1.087929s  1.074926s
Naive.quicksort     0.875477s  1.019984s
STUArray.quicksort  0.215551s  0.221301s
VSort.vsort         0.236661s  0.230287s

Radix:
list size: 300000         -O2  -fllvm-O2
────────             ────────   ────────
Data.List.sort      1.085658s  1.085726s
Naive.quicksort     1.002067s  0.900985s
STUArray.quicksort  0.217371s  0.228973s
VSort.vsort         1.958216s  1.970619s

Conclusion: Unless you have a specific reason not to, using one of the good general-purpose sorting algorithms from vector-algorithms, with a wrapper to convert from and to lists if necessary, is the recommended way to sort large lists. (These algorithms also work well with boxed vectors, in my measurements approximately 50% slower than unboxed.) For short lists, the overhead of the conversion would be so large that it doesn't pay.

Now, at @applicative's suggestion, a look at the sorting times for vector-algorithms' introsort, a quicksort on unboxed vectors and an improved (shamelessly stealing the implementation of unstablePartition) quicksort on STUArrays.

The improved STUArray quicksort:

{-# LANGUAGE BangPatterns #-}
module NQuick (stuqsort) where


import Data.Array.Base (unsafeRead, unsafeWrite, getNumElements)
import Data.Array.ST
import Control.Monad.ST
import Control.Monad (when)

stuqsort :: STUArray s Int Int -> ST s ()
stuqsort arr = do
    n <- getNumElements arr
    when (n > 1) (myqsort arr 0 (n-1))

myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi = do
    p <- unsafeRead a hi
    j <- unstablePartition (< p) lo hi a
    h <- unsafeRead a j
    unsafeWrite a j p
    unsafeWrite a hi h
    when (j > lo+1) (myqsort a lo (j-1))
    when (j+1 < hi) (myqsort a (j+1) hi)

unstablePartition :: (Int -> Bool) -> Int -> Int -> STUArray s Int Int -> ST s Int
{-# INLINE unstablePartition #-}
unstablePartition f !lf !rg !v = from_left lf rg
  where
    from_left i j
      | i == j    = return i
      | otherwise = do
                      x <- unsafeRead v i
                      if f x
                        then from_left (i+1) j
                        else from_right i (j-1)

    from_right i j
      | i == j    = return i
      | otherwise = do
                      x <- unsafeRead v j
                      if f x
                        then do
                               y <- unsafeRead v i
                               unsafeWrite v i x
                               unsafeWrite v j y
                               from_left (i+1) j
                        else from_right i (j-1)

The vector quicksort:

module VectorQuick (vquicksort) where

import qualified Data.Vector.Unboxed.Mutable as UM
import qualified Data.Vector.Generic.Mutable as GM
import Control.Monad.ST
import Control.Monad (when)

vquicksort :: UM.

Categories

algorithm - Using vectors for performance improvement in Haskell

algorithm - Using vectors for performance improvement in Haskell

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags