Code for Concinnity


In Search of Performance in Haskell

I was doing ITA Software’s “Word Numbers” problem. Now there are various solutions on the Web. Reviewing those solutions tell us it’s a graph/grammar problem. Today we’re going to solve it brute-force.

Now let me give a quick recap of the problem:

A "word number" is defined as

wordNumber 1 = "one" wordNumber 2 = "onetwo" wordNumber 3 = "onetwothree" ... wordNumber 12 = "onetwothreefourfivesixseveneightnineteneleventwelve" ...

Find the 51-billion-th letter of the wordNumber Infinity. Assume that letter is found for wordNumber x, also find the sum of 1 to x

(As someone noted in the StackOverflow comments, I actually misinterpreted the original problem. But that doesn’t affect our purpose here.)

A naive algorithm in C++

The idea seems simple: 2 counters, one for sum of numbers and one for length. We go from wordNumber 1 until the length meets 51000000000. Here is how I said it in C++:

OK, nothing surprising. It takes 45 seconds on my machine to execute (g++ -O3). We will use this as a baseline.

First attempt in Haskell

OK, I lied. It was not my first attempt. My initial version had space leaks which led to the program eating up 1GB RAM in several seconds. Using profiling techniques from Real World Haskell, I identified the space leaks and sprinkled Bang patterns everywhere.

So how well does that run?

Well, I don’t know.

As far as I remember, one version finished in 12 minutes, but I am not sure it is the version posted here. In short, the performance was not bearable.

Now if we were using Perl or Ruby, that might not have been too surprising. But for Haskell this seems something obvious is missing.

The solution

So there I went, I asked for help on StackOverflow and in Haskell-cafe.

There I got help from Bryan O’Sullivan who posted his version. Reiner Pope said that with -fllvm, it finishes in about 1.5 minutes. (OK, he did not exactly say that. He said Haskell version is half as fast as C++.)

But it did not work for me

So I downloaded Bryan’s code (we’ll call it RealWordNumber.hs from now on, sorry bad pun :) and off I went in excitement:

> ghc -O2 RealWordNumber.hs
> RealWordNumber
> Sum: 1
> Segmentation fault/access violation in generated code

Since RealWordNumber.hs uses Int, there is an integer overflow in his code. That’s also exactly the reason why I had to use Int64 in my original version.

(I was using Windows where GHC only has 32-bit)

So I patched his code to use Int64 and off I went. It never finished, I had to kill it.

Huh?

The culprit — 64-bit integers are slow in 32-bit programs

OK the major reason turned out to be this. To confirm, I compiled the C++ version to 32-bit using VC++ (cl /Ox WordNumber.cpp). It took 4 minutes to complete.

So I took RealWordNumber.hs to Arch Linux x64. It turned out that Int in GHC x64 is actually Int64, and the program completed in 5 minutes (ghc -O2).

Since the LLVM backend for GHC 7.0.3 seemed broken at the time I write this, and I did not want to go fixing it immediately, I trusted that it would really turn to 1.5 minutes if I ran it with -fllvm.

Now the fun part — breaking down performance improvements

“So,” I figured, “I just needed to take my original version and put it on Arch x64″. So I tried that, and it was still taking longer than my patience allowed (8+ minutes). The following will explain how I got my version up to the same league with RealWordNumber.hs, step by step.

First of all, I tuned the number down to 510 million (510000000) to avoid having to wait all day doing this.

Here is a base line using 510 million as an input:

RealWordNumber -- 1.94 seconds
K2WordNumber   -- 5.41 seconds

From here we see that if I had waited about 15 minutes, I could have seen my original version finish :P

Now on to how I got it up to speed:

1. Change divMod to quotRem

I think I heard it from #haskell, people say that quotRem is faster than divMod. So I simply changed all divMod to quotRem. Look at what we’ve got:

K2WordNumber-1 -- 4.48 seconds

OK, that was true. A surprising huge speed up.

2. Change Int64 to Int

Since GHC x64′s Int is actually 64-bit, I changed all Int64 to Int and removed the unnecessary fromIntegral calls.

K2WordNumber-2 -- 3.20 seconds

2 simple changes, we are already getting there :P

3. Change quotRem to separate quot and rem

Since quotRem returns a tuple of boxed Int64, which may need to be constructed and reconstructed. Max Bolingbroke suggested that using quot and rem separately might help. Here’s the experiment result

K2WordNumber-3 -- 3.13 seconds

It might have been an improvement, but might have just been experimental error. Unfortunately it seems like not much speed up was achieved.

4. Removed redundant parameters

It turned out in my original version, there were several redundant parameters being passed recursively in the function solve (the acc label, and curr). I used ghc -Wall to track them and removed them.

K2WordNumber-4 -- 3.16 seconds

No change. Since GHC figured out enough to warn me, it probably did the right thing and optimized them for me anyway.

5. Simplified parameter passing

The original solve function looked like this

1
2
3
4
5
6
7
solve :: Int64 -> (Int64, Int64, Int64) -> [Int64] -> (Int64, Int64, Int64)
solve !n !acc@(!sumNum, !sumLen, !curr) (!num:nums)
    | sumLen' >= n = (sumNum', sumLen, num)
    | otherwise = solve n (sumNum', sumLen', num) nums
    where
        sumNum' = sumNum + num
        sumLen' = sumLen + wordLength num

To quote Bryan:

You’re passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.

He is right. I also noticed that the n parameter does not change and does not need to be passed on. Johan Tibell blogged that GHC could in theory create this closure for us. But I did it manually to see what happens:

1
2
3
4
5
6
7
8
9
solve :: Int -> (Int, Int, Int)
solve n = go 0 0 1
    where
        go !sumNum sumLen i
            | sumLen' >= n = (sumNum', sumLen, i)
            | otherwise = go sumNum' sumLen' (i+1)
            where
                sumNum' = sumNum + i
                sumLen' = sumLen + wordLength i

That resulted in

K2WordNumber-5 -- 2.87 seconds

Good.

6. Using unboxed integer types

I replaced quot and rem with these

1
2
    (I# a) // (I# b) = I# (a `quotInt#` b)
    (I# a) % (I# b) = I# (a `remInt#` b)

K2WordNumber-6 — 2.77 seconds

Not as much as I had thought

7. Using Data.Vector.Unboxed instead of Data.Array.Unboxed

K2WordNumber-7 -- 2.50 seconds

8. Using Data.Vector.Generic.unsafeIndex instead of the (!) operator

K2WordNumber-8 -- 2.20 seconds

Wow, this one was huge.

9. Move the length vectors inside wordLength‘s closure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
wordLength :: Int -> Int
wordLength i = go 0 i
    where
        go !pad !n
            | n < 10         = lenOnes `VG.unsafeIndex` n + pad
            | n < 20         = lenTeens `VG.unsafeIndex` (n-10) + pad
            | n < 100        = go (lenTens `VG.unsafeIndex` (n//10) + pad) (n%10)
            | n < 1000       = go (go (7+pad) (n//100)) (n%100)
            | n < 1000000    = go (go (8+pad) (n//1000)) (n%1000)
            | otherwise      = go (go (7+pad) (n//1000000)) (n%1000000)
        (I# a) // (I# b) = I# (a `quotInt#` b)
        (I# a) % (I# b) = I# (a `remInt#` b)
        lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ...
        lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6]
        lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3

K2WordNumber-9a — 2.20 seconds

No improvement at all… But if I add bangs before the vectors…

1
2
3
!lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ...
!lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6]
!lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3

K2WordNumber-9 — 2.00 seconds

WOW!! That was a large speed up. It turns out the reason for the speed up is related to strictness. wordLength is not strict in all lenOnes, lenTens and lenTeens, so they might be garbaged collected (?). Putting bangs on them allow them to stay alive. Since we need to use those arrays very frequently, it was better for them to be around all the time.

Haskell did not allow putting bangs on “global” variables, so I missed it in my original bang sprinkling exercise.

OK, so now we have transformed my original sloppy Haskell program to a “well-performing” Haskell program. Since it performs the same as Bryan O’Sullivan (casually) tuned version, I am going to assume it’s probably as fast as it should (before maybe resorting to serious hacks).

And here is the final version of the code:

Moral of the story

  • Haskell is a very fast functional language. But for tight areas like this, always consider giving plain old C a try — it may just save you days of profiling.
  • 64-bit operations in 32-bit programs are slow (*)
  • quotRem is faster than divMod
  • Earlier I argued in #haskell that “if Integer vs Int matters for your performance, your program is probably written wrong”. Well, there are cases where it matters quite a lot. It turns out conversions are quite costly. (Though I still believe people should use Integer by default)
  • Use the wrapper-worker pattern liberally.
  • If you recurse, make sure every variable actually changes in each recursion. (For most cases this will benefit. If you only recurse several times maybe the closure’s overhead will be larger.)
  • Don’t use a list or tuple just because infinite lists look smart.
  • Data.Vector is a better “general case” container than Data.Array
  • Try to reduce the scope of “global” variables. Put them in a closure so you can bang on it.

* This may be actually the suckage of Windows’ “WOW64″ (32-bit emulation layer). I have not tried to prove this on 32-bit Linux GHC)

Extra: the fastest indexing data structure — functions!?

I heard from #haskell that “pattern matching in Haskell is less than O(n)”. When I heard it, I didn’t really believe it. I have always thought that N pattern matches means translating into N if-then-else statements for the general case. He said something related to Church encoding but I did not see how it could be applied in the general case to avoid N if-then-else statements.

Just for fun, I tried changing the length vectors into functions:

1
2
3
4
5
6
lenOnes 0 = 0
lenOnes 1 = 3
lenOnes 2 = 3
lenOnes 3 = 5
-- ...
lenTeens 9 = 8

K2WordNumber-10 — 2.00 seconds

Yes, pattern matching is as fast as unsafeIndex. WOW! wren nt thornton from the comments below give a more thorough explanation on this one. It turns out GHC is really darn smart and can figure out fast code given “primitive” definitions.

Published by kizzx2, on August 10th, 2011 at 12:21 am. Filled under: Coding Tags: 3 Comments