I was doing ITA Software’s “Word Numbers” problem. Now there are various solutions on the Web. Reviewing those solutions tell us it’s a graph/grammar problem. Today we’re going to solve it brute-force.
Now let me give a quick recap of the problem:
A "word number" is defined as
wordNumber 1 = "one" wordNumber 2 = "onetwo" wordNumber 3 = "onetwothree" ... wordNumber 12 = "onetwothreefourfivesixseveneightnineteneleventwelve" ...
Find the 51-billion-th letter of the wordNumber Infinity. Assume that letter is found for wordNumber x, also find the sum of 1 to x
(As someone noted in the StackOverflow comments, I actually misinterpreted the original problem. But that doesn’t affect our purpose here.)
A naive algorithm in C++
The idea seems simple: 2 counters, one for sum of numbers and one for length. We go from wordNumber 1 until the length meets 51000000000. Here is how I said it in C++:
OK, nothing surprising. It takes 45 seconds on my machine to execute (g++ -O3). We will use this as a baseline.
First attempt in Haskell
OK, I lied. It was not my first attempt. My initial version had space leaks which led to the program eating up 1GB RAM in several seconds. Using profiling techniques from Real World Haskell, I identified the space leaks and sprinkled Bang patterns everywhere.
So how well does that run?
Well, I don’t know.
As far as I remember, one version finished in 12 minutes, but I am not sure it is the version posted here. In short, the performance was not bearable.
Now if we were using Perl or Ruby, that might not have been too surprising. But for Haskell this seems something obvious is missing.
There I got help from Bryan O’Sullivan who posted his version. Reiner Pope said that with
-fllvm, it finishes in about 1.5 minutes. (OK, he did not exactly say that. He said Haskell version is half as fast as C++.)
But it did not work for me
So I downloaded Bryan’s code (we’ll call it
RealWordNumber.hs from now on, sorry bad pun 🙂 and off I went in excitement:
ghc -O2 RealWordNumber.hs RealWordNumber Sum: 1 Segmentation fault/access violation in generated code
Int, there is an integer overflow in his code. That’s also exactly the reason why I had to use
Int64 in my original version.
(I was using Windows where GHC only has 32-bit)
So I patched his code to use
Int64 and off I went. It never finished, I had to kill it.
The culprit — 64-bit integers are slow in 32-bit programs
OK the major reason turned out to be this. To confirm, I compiled the C++ version to 32-bit using VC++ (cl /Ox WordNumber.cpp). It took 4 minutes to complete.
So I took RealWordNumber.hs to Arch Linux x64. It turned out that Int in GHC x64 is actually
Int64, and the program completed in 5 minutes
Since the LLVM backend for GHC 7.0.3 seemed broken at the time I write this, and I did not want to go fixing it immediately, I trusted that it would really turn to 1.5 minutes if I ran it with
Now the fun part — breaking down performance improvements
“So,” I figured, “I just needed to take my original version and put it on Arch x64″. So I tried that, and it was still taking longer than my patience allowed (8+ minutes). The following will explain how I got my version up to the same league with
RealWordNumber.hs, step by step.
First of all, I tuned the number down to 510 million (510000000) to avoid having to wait all day doing this.
Here is a base line using 510 million as an input:
RealWordNumber -- 1.94 seconds K2WordNumber -- 5.41 seconds
From here we see that if I had waited about 15 minutes, I could have seen my original version finish 😛
Now on to how I got it up to speed:
I think I heard it from #haskell, people say that
quotRem is faster than
divMod. So I simply changed all
quotRem. Look at what we’ve got:
K2WordNumber-1 -- 4.48 seconds
OK, that was true. A surprising huge speed up.
Since GHC x64’s
Int is actually 64-bit, I changed all
Int and removed the unnecessary
K2WordNumber-2 -- 3.20 seconds
2 simple changes, we are already getting there 😛
3. Change quotRem to separate quot and rem
quotRem returns a tuple of boxed
Int64, which may need to be constructed and reconstructed. Max Bolingbroke suggested that using
rem separately might help. Here’s the experiment result
K2WordNumber-3 -- 3.13 seconds
It might have been an improvement, but might have just been experimental error. Unfortunately it seems like not much speed up was achieved.
4. Removed redundant parameters
It turned out in my original version, there were several redundant parameters being passed recursively in the function solve (the
acc label, and
curr). I used
ghc -Wall to track them and removed them.
K2WordNumber-4 -- 3.16 seconds
No change. Since GHC figured out enough to warn me, it probably did the right thing and optimized them for me anyway.
5. Simplified parameter passing
solve function looked like this
solve :: Int64 -> (Int64, Int64, Int64) -> [Int64] -> (Int64, Int64, Int64) solve !n !acc@(!sumNum, !sumLen, !curr) (!num:nums) | sumLen' >= n = (sumNum', sumLen, num) | otherwise = solve n (sumNum', sumLen', num) nums where sumNum' = sumNum + num sumLen' = sumLen + wordLength num
To quote Bryan:
You’re passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.
He is right. I also noticed that the n parameter does not change and does not need to be passed on. Johan Tibell blogged that GHC could in theory create this closure for us. But I did it manually to see what happens:
solve :: Int -> (Int, Int, Int) solve n = go 0 0 1 where go !sumNum sumLen i | sumLen' >= n = (sumNum', sumLen, i) | otherwise = go sumNum' sumLen' (i+1) where sumNum' = sumNum + i sumLen' = sumLen + wordLength i
That resulted in
K2WordNumber-5 -- 2.87 seconds
6. Using unboxed integer types
rem with these
(I# a) // (I# b) = I# (a `quotInt#` b) (I# a) % (I# b) = I# (a `remInt#` b)
K2WordNumber-6 — 2.77 seconds
Not as much as I had thought
Data.Vector.Unboxed instead of
K2WordNumber-7 -- 2.50 seconds
Data.Vector.Generic.unsafeIndexinstead of the
K2WordNumber-8 -- 2.20 seconds
Wow, this one was huge.
- Move the length vectors inside
wordLength :: Int -> Int wordLength i = go 0 i where go !pad !n | n < 10 = lenOnes `VG.unsafeIndex` n + pad | n < 20 = lenTeens `VG.unsafeIndex` (n-10) + pad | n < 100 = go (lenTens `VG.unsafeIndex` (n//10) + pad) (n%10) | n < 1000 = go (go (7+pad) (n//100)) (n%100) | n < 1000000 = go (go (8+pad) (n//1000)) (n%1000) | otherwise = go (go (7+pad) (n//1000000)) (n%1000000) (I# a) // (I# b) = I# (a `quotInt#` b) (I# a) % (I# b) = I# (a `remInt#` b) lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ... lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6] lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3
K2WordNumber-9a — 2.20 seconds
No improvement at all… But if I add bangs before the vectors…
!lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ... !lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6] !lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3
K2WordNumber-9 — 2.00 seconds
WOW!! That was a large speed up. It turns out the reason for the speed up is related to strictness.
wordLength is not strict in all
lenTeens, so they might be garbaged collected (?). Putting bangs on them allow them to stay alive. Since we need to use those arrays very frequently, it was better for them to be around all the time.
Haskell did not allow putting bangs on “global” variables, so I missed it in my original bang sprinkling exercise.
OK, so now we have transformed my original sloppy Haskell program to a “well-performing” Haskell program. Since it performs the same as Bryan O’Sullivan (casually) tuned version, I am going to assume it’s probably as fast as it should (before maybe resorting to serious hacks).
And here is the final version of the code:
Moral of the story
- Haskell is a very fast functional language. But for tight areas like this, always consider giving plain old C a try — it may just save you days of profiling.
- 64-bit operations in 32-bit programs are slow (*)
quotRemis faster than
- Earlier I argued in #haskell that “if
Intmatters for your performance, your program is probably written wrong”. Well, there are cases where it matters quite a lot. It turns out conversions are quite costly. (Though I still believe people should use
- Use the wrapper-worker pattern liberally.
- If you recurse, make sure every variable actually changes in each recursion. (For most cases this will benefit. If you only recurse several times maybe the closure’s overhead will be larger.)
- Don’t use a list or tuple just because infinite lists look smart.
Data.Vectoris a better “general case” container than
Try to reduce the scope of “global” variables. Put them in a closure so you can bang on it.
This may be actually the suckage of Windows’ “WOW64″ (32-bit emulation layer). I have not tried to prove this on 32-bit Linux GHC)
Extra: the fastest indexing data structure — functions!?
I heard from #haskell that “pattern matching in Haskell is less than O(n)”. When I heard it, I didn’t really believe it. I have always thought that N pattern matches means translating into N
if-then-else statements for the general case. He said something related to Church encoding but I did not see how it could be applied in the general case to avoid N
Just for fun, I tried changing the length vectors into functions:
lenOnes 0 = 0 lenOnes 1 = 3 lenOnes 2 = 3 lenOnes 3 = 5 -- ... lenTeens 9 = 8
K2WordNumber-10 — 2.00 seconds
Yes, pattern matching is as fast as unsafeIndex. WOW! wren nt thornton from the comments below give a more thorough explanation on this one. It turns out GHC is really darn smart and can figure out fast code given “primitive” definitions.