In Search of Performance in Haskell
I was doing ITA Software’s “Word Numbers” problem. Now there are various solutions on the Web. Reviewing those solutions tell us it’s a graph/grammar problem. Today we’re going to solve it brute-force.
Now let me give a quick recap of the problem:
A "word number" is defined aswordNumber 1 = "one" wordNumber 2 = "onetwo" wordNumber 3 = "onetwothree" ... wordNumber 12 = "onetwothreefourfivesixseveneightnineteneleventwelve" ...
Find the 51-billion-th letter of the wordNumber Infinity. Assume that letter is found for wordNumber x, also find the sum of 1 to x
(As someone noted in the StackOverflow comments, I actually misinterpreted the original problem. But that doesn’t affect our purpose here.)
A naive algorithm in C++
The idea seems simple: 2 counters, one for sum of numbers and one for length. We go from wordNumber 1 until the length meets 51000000000. Here is how I said it in C++:
OK, nothing surprising. It takes 45 seconds on my machine to execute (g++ -O3). We will use this as a baseline.
First attempt in Haskell
OK, I lied. It was not my first attempt. My initial version had space leaks which led to the program eating up 1GB RAM in several seconds. Using profiling techniques from Real World Haskell, I identified the space leaks and sprinkled Bang patterns everywhere.
So how well does that run?
Well, I don’t know.
As far as I remember, one version finished in 12 minutes, but I am not sure it is the version posted here. In short, the performance was not bearable.
Now if we were using Perl or Ruby, that might not have been too surprising. But for Haskell this seems something obvious is missing.
The solution
So there I went, I asked for help on StackOverflow and in Haskell-cafe.
There I got help from Bryan O’Sullivan who posted his version. Reiner Pope said that with -fllvm
, it finishes in about 1.5 minutes. (OK, he did not exactly say that. He said Haskell version is half as fast as C++.)
But it did not work for me
So I downloaded Bryan’s code (we’ll call it RealWordNumber.hs
from now on, sorry bad pun 🙂 and off I went in excitement:
ghc -O2 RealWordNumber.hs RealWordNumber Sum: 1 Segmentation fault/access violation in generated code
Since RealWordNumber.hs
uses Int
, there is an integer overflow in his code. That’s also exactly the reason why I had to use Int64
in my original version.
(I was using Windows where GHC only has 32-bit)
So I patched his code to use Int64
and off I went. It never finished, I had to kill it.
Huh?
The culprit — 64-bit integers are slow in 32-bit programs
OK the major reason turned out to be this. To confirm, I compiled the C++ version to 32-bit using VC++ (cl /Ox WordNumber.cpp). It took 4 minutes to complete.
So I took RealWordNumber.hs to Arch Linux x64. It turned out that Int in GHC x64 is actually Int64
, and the program completed in 5 minutes (ghc -O2)
.
Since the LLVM backend for GHC 7.0.3 seemed broken at the time I write this, and I did not want to go fixing it immediately, I trusted that it would really turn to 1.5 minutes if I ran it with -fllvm
.
Now the fun part — breaking down performance improvements
“So,” I figured, “I just needed to take my original version and put it on Arch x64″. So I tried that, and it was still taking longer than my patience allowed (8+ minutes). The following will explain how I got my version up to the same league with RealWordNumber.hs
, step by step.
First of all, I tuned the number down to 510 million (510000000) to avoid having to wait all day doing this.
Here is a base line using 510 million as an input:
RealWordNumber -- 1.94 seconds K2WordNumber -- 5.41 seconds
From here we see that if I had waited about 15 minutes, I could have seen my original version finish 😛
Now on to how I got it up to speed:
1. Change divMod
to quotRem
I think I heard it from #haskell, people say that quotRem
is faster than divMod
. So I simply changed all divMod
to quotRem
. Look at what we’ve got:
K2WordNumber-1 -- 4.48 seconds
OK, that was true. A surprising huge speed up.
2. Change Int64
to Int
Since GHC x64’s Int
is actually 64-bit, I changed all Int64
to Int
and removed the unnecessary fromIntegral
calls.
K2WordNumber-2 -- 3.20 seconds
2 simple changes, we are already getting there 😛
3. Change quotRem to separate quot and rem
Since quotRem
returns a tuple of boxed Int64
, which may need to be constructed and reconstructed. Max Bolingbroke suggested that using quot
and rem
separately might help. Here’s the experiment result
K2WordNumber-3 -- 3.13 seconds
It might have been an improvement, but might have just been experimental error. Unfortunately it seems like not much speed up was achieved.
4. Removed redundant parameters
It turned out in my original version, there were several redundant parameters being passed recursively in the function solve (the acc
label, and curr
). I used ghc -Wall
to track them and removed them.
K2WordNumber-4 -- 3.16 seconds
No change. Since GHC figured out enough to warn me, it probably did the right thing and optimized them for me anyway.
5. Simplified parameter passing
The original solve
function looked like this
solve :: Int64 -> (Int64, Int64, Int64) -> [Int64] -> (Int64, Int64, Int64) solve !n !acc@(!sumNum, !sumLen, !curr) (!num:nums) | sumLen' >= n = (sumNum', sumLen, num) | otherwise = solve n (sumNum', sumLen', num) nums where sumNum' = sumNum + num sumLen' = sumLen + wordLength num
To quote Bryan:
You’re passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.
He is right. I also noticed that the n parameter does not change and does not need to be passed on. Johan Tibell blogged that GHC could in theory create this closure for us. But I did it manually to see what happens:
solve :: Int -> (Int, Int, Int) solve n = go 0 0 1 where go !sumNum sumLen i | sumLen' >= n = (sumNum', sumLen, i) | otherwise = go sumNum' sumLen' (i+1) where sumNum' = sumNum + i sumLen' = sumLen + wordLength i
That resulted in
K2WordNumber-5 -- 2.87 seconds
Good.
6. Using unboxed integer types
I replaced quot
and rem
with these
(I# a) // (I# b) = I# (a `quotInt#` b) (I# a) % (I# b) = I# (a `remInt#` b)
K2WordNumber-6 — 2.77 seconds
Not as much as I had thought
7. Using Data.Vector.Unboxed
instead of Data.Array.Unboxed
K2WordNumber-7 -- 2.50 seconds
- Using
Data.Vector.Generic.unsafeIndex
instead of the(!)
operator
K2WordNumber-8 -- 2.20 seconds
Wow, this one was huge.
- Move the length vectors inside
wordLength
‘s closure
wordLength :: Int -> Int wordLength i = go 0 i where go !pad !n | n < 10 = lenOnes `VG.unsafeIndex` n + pad | n < 20 = lenTeens `VG.unsafeIndex` (n-10) + pad | n < 100 = go (lenTens `VG.unsafeIndex` (n//10) + pad) (n%10) | n < 1000 = go (go (7+pad) (n//100)) (n%100) | n < 1000000 = go (go (8+pad) (n//1000)) (n%1000) | otherwise = go (go (7+pad) (n//1000000)) (n%1000000) (I# a) // (I# b) = I# (a `quotInt#` b) (I# a) % (I# b) = I# (a `remInt#` b) lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ... lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6] lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3
K2WordNumber-9a — 2.20 seconds
No improvement at all… But if I add bangs before the vectors…
!lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ... !lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6] !lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3
K2WordNumber-9 — 2.00 seconds
WOW!! That was a large speed up. It turns out the reason for the speed up is related to strictness. wordLength
is not strict in all lenOnes
, lenTens
and lenTeens
, so they might be garbaged collected (?). Putting bangs on them allow them to stay alive. Since we need to use those arrays very frequently, it was better for them to be around all the time.
Haskell did not allow putting bangs on “global” variables, so I missed it in my original bang sprinkling exercise.
OK, so now we have transformed my original sloppy Haskell program to a “well-performing” Haskell program. Since it performs the same as Bryan O’Sullivan (casually) tuned version, I am going to assume it’s probably as fast as it should (before maybe resorting to serious hacks).
And here is the final version of the code:
Moral of the story
- Haskell is a very fast functional language. But for tight areas like this, always consider giving plain old C a try — it may just save you days of profiling.
- 64-bit operations in 32-bit programs are slow (*)
quotRem
is faster thandivMod
- Earlier I argued in #haskell that “if
Integer
vsInt
matters for your performance, your program is probably written wrong”. Well, there are cases where it matters quite a lot. It turns out conversions are quite costly. (Though I still believe people should useInteger
by default) - Use the wrapper-worker pattern liberally.
- If you recurse, make sure every variable actually changes in each recursion. (For most cases this will benefit. If you only recurse several times maybe the closure’s overhead will be larger.)
- Don’t use a list or tuple just because infinite lists look smart.
Data.Vector
is a better “general case” container thanData.Array
Try to reduce the scope of “global” variables. Put them in a closure so you can bang on it.
This may be actually the suckage of Windows’ “WOW64″ (32-bit emulation layer). I have not tried to prove this on 32-bit Linux GHC)
Extra: the fastest indexing data structure — functions!?
I heard from #haskell that “pattern matching in Haskell is less than O(n)”. When I heard it, I didn’t really believe it. I have always thought that N pattern matches means translating into N if-then-else
statements for the general case. He said something related to Church encoding but I did not see how it could be applied in the general case to avoid N if-then-else
statements.
Just for fun, I tried changing the length vectors into functions:
lenOnes 0 = 0 lenOnes 1 = 3 lenOnes 2 = 3 lenOnes 3 = 5 -- ... lenTeens 9 = 8
K2WordNumber-10 — 2.00 seconds
Yes, pattern matching is as fast as unsafeIndex. WOW! wren nt thornton from the comments below give a more thorough explanation on this one. It turns out GHC is really darn smart and can figure out fast code given “primitive” definitions.