I was doing ITA Software’s “Word Numbers” problem. Now there are various solutions on the Web. Reviewing those solutions tell us it’s a graph/grammar problem. Today we’re going to solve it brute-force.
Now let me give a quick recap of the problem:
A "word number" is defined as
wordNumber 1 = "one"
wordNumber 2 = "onetwo"
wordNumber 3 = "onetwothree"
...
wordNumber 12 = "onetwothreefourfivesixseveneightnineteneleventwelve"
...
Find the 51-billion-th letter of the wordNumber Infinity. Assume that letter is found for wordNumber x, also find the sum of 1 to x
(As someone noted in the StackOverflow comments, I actually misinterpreted the original problem. But that doesn’t affect our purpose here.)
A naive algorithm in C++
The idea seems simple: 2 counters, one for sum of numbers and one for length. We go from wordNumber 1 until the length meets 51000000000. Here is how I said it in C++:
// Brute force solver for the Word Number problem
#include <iostream>
#include <string>
int length_ones[10] = {0,3,3,5,4,4,3,5,5,4}; // "", one, two, three, ...
int length_tens[10] = {0,3,6,6,5,5,5,7,6,6}; // "", ten, twenty, ...
int length_teens[10] = {3,6,6,8,8,7,7,9,8,8}; // ten, eleven, twelve, ...
const char * ones[] = {"", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"};
const char * tens[] = {"", "ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"};
const char * teens[] = {"ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteeen", "nineteen"};
std::string wordify(long n)
{
if(n < 10) return ones[n];
else if(n < 20) return teens[n-10];
else if(n < 100) return std::string(tens[n/10]) + ones[n%10];
else if(n < 1000) return std::string(ones[n/100]) + "hundred" + wordify(n%100);
else if(n < 1000000) return wordify(n/1000) + "thousand" + wordify(n%1000);
else return wordify(n/1000000) + "million" + wordify(n%1000000);
}
int word_length(long n)
{
if(n < 10) return length_ones[n];
else if(n < 20) return length_teens[n-10];
else if(n < 100) return length_tens[n/10] + length_ones[n%10];
else if(n < 1000) return length_ones[n/100] + 7 + word_length(n%100); // 7 for "hundred"
else if(n < 1000000) return word_length(n/1000) + 8 + word_length(n%1000);
else return word_length(n/1000000) + 7 + word_length(n%1000000);
}
int main()
{
long sumNumbers = 0;
long sumLength = 0;
const long target = 51000000000;
for(long i; i < 999999999; i++)
{
sumNumbers += i;
long newSumLength = word_length(i) + sumLength;
if(newSumLength >= target)
break;
sumLength = newSumLength;
}
std::cout << "Sum: " << sumNumbers << std::endl;
std::cout << "The letter is " << wordify(i)[target - sumLength - 1] << std::endl;
return 0;
}
OK, nothing surprising. It takes 45 seconds on my machine to execute (g++ -O3). We will use this as a baseline.
First attempt in Haskell
-- Original version
-- Can you spot how to improve this performance by 3 fold right away?
{-# LANGUAGE BangPatterns #-}
import Debug.Trace
import Data.Int
import Control.Monad
import Data.Array.Unboxed
ones = ["", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
tens = ["", "ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
teens = ["ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
lenOnes, lenTens, lenTeens :: UArray Int64 Int64
lenOnes = listArray (0,9) $ [0,3,3,5,4,5,3,5,5,4] -- "", "one","two", ...
lenTens = listArray (0,9) $ [0,3,6,6,5,5,5,7,6,6]
lenTeens = listArray (0,9) $ [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3
-- potentially cleaner version
-- but I feared Haskell might do surprising things behind my
-- back so I stuck with the above
-- lenBelowHundred = listArray (0,99) $ map (fromIntegral . length . wordify) [1..99]
-- wordify 123 = "onehundredtwentythree"
-- This is only used once in presenting the final result character
wordify :: Int64 -> String
wordify n
| n < 10 = ones !! fromIntegral n
| n < 20 = teens !! (fromIntegral n-10)
| n < 100 = splitterTen
| n < 1000 = splitter 100 "hundred"
| n < 1000000 = splitter 1000 "thousand"
| n < 1000000000 = splitter 1000000 "million"
where
splitterTen = let (t, x) = n `divMod` 10
in (tens !! fromIntegral t) ++ wordify x
splitter div suffix = let (t, x) = n `divMod` div
in (wordify t) ++ suffix ++ wordify x
-- Optimized version of length (wordify n)
-- Used in number crunching
wordLength n = wordLength' 0 n
-- Tail recursive version
wordLength' :: Int64 -> Int64 -> Int64
wordLength' !pad !n
| n < 10 = lenOnes ! n + pad
| n < 20 = lenTeens ! (n-10) + pad
| n < 100 = splitterTen
| n < 1000 = splitter 100 7
| n < 1000000 = splitter 1000 8
| otherwise = splitter 1000000 7
where
splitterTen = let !(!t, !x) = n `divMod` 10
in wordLength' (lenTens ! t + pad) x
splitter !d !suffix = let !(!t, !x) = n `divMod` d
in wordLength' (wordLength' (suffix+pad) t) x
-- Tail recursive
solve :: Int64 -> (Int64, Int64, Int64) -> [Int64] -> (Int64, Int64, Int64)
solve !n !acc@(!sumNum, !sumLen, !curr) (!num:nums)
| sumLen' >= n = (sumNum', sumLen, num)
| otherwise = solve n (sumNum', sumLen', num) nums
where
sumNum' = sumNum + num
sumLen' = sumLen + wordLength num
solution :: Int64 -> (Int64, Char)
solution !x =
let (sumNum, sumLen, n) = solve x (0,0,1) [1..]
in (sumNum, (wordify n) !! (fromIntegral $ x - sumLen - 1))
main = do
print $ solution 1234 -- Make sure we are sane
print $ solution 51000000000
OK, I lied. It was not my first attempt. My initial version had space leaks which led to the program eating up 1GB RAM in several seconds. Using profiling techniques from Real World Haskell, I identified the space leaks and sprinkled Bang patterns everywhere.
So how well does that run?
Well, I don’t know.
As far as I remember, one version finished in 12 minutes, but I am not sure it is the version posted here. In short, the performance was not bearable.
Now if we were using Perl or Ruby, that might not have been too surprising. But for Haskell this seems something obvious is missing.
The solution
So there I went, I asked for help on StackOverflow and in Haskell-cafe.
There I got help from Bryan O’Sullivan who posted his version. Reiner Pope said that with -fllvm, it finishes in about 1.5 minutes. (OK, he did not exactly say that. He said Haskell version is half as fast as C++.)
But it did not work for me
So I downloaded Bryan’s code (we’ll call it RealWordNumber.hs from now on, sorry bad pun
and off I went in excitement:
> ghc -O2 RealWordNumber.hs
> RealWordNumber
> Sum: 1
> Segmentation fault/access violation in generated code
Since RealWordNumber.hs uses Int, there is an integer overflow in his code. That’s also exactly the reason why I had to use Int64 in my original version.
(I was using Windows where GHC only has 32-bit)
So I patched his code to use Int64 and off I went. It never finished, I had to kill it.
Huh?
The culprit — 64-bit integers are slow in 32-bit programs
OK the major reason turned out to be this. To confirm, I compiled the C++ version to 32-bit using VC++ (cl /Ox WordNumber.cpp). It took 4 minutes to complete.
So I took RealWordNumber.hs to Arch Linux x64. It turned out that Int in GHC x64 is actually Int64, and the program completed in 5 minutes (ghc -O2).
Since the LLVM backend for GHC 7.0.3 seemed broken at the time I write this, and I did not want to go fixing it immediately, I trusted that it would really turn to 1.5 minutes if I ran it with -fllvm.
Now the fun part — breaking down performance improvements
“So,” I figured, “I just needed to take my original version and put it on Arch x64″. So I tried that, and it was still taking longer than my patience allowed (8+ minutes). The following will explain how I got my version up to the same league with RealWordNumber.hs, step by step.
First of all, I tuned the number down to 510 million (510000000) to avoid having to wait all day doing this.
Here is a base line using 510 million as an input:
RealWordNumber -- 1.94 seconds
K2WordNumber -- 5.41 seconds
From here we see that if I had waited about 15 minutes, I could have seen my original version finish
Now on to how I got it up to speed:
1. Change divMod to quotRem
I think I heard it from #haskell, people say that quotRem is faster than divMod. So I simply changed all divMod to quotRem. Look at what we’ve got:
K2WordNumber-1 -- 4.48 seconds
OK, that was true. A surprising huge speed up.
2. Change Int64 to Int
Since GHC x64′s Int is actually 64-bit, I changed all Int64 to Int and removed the unnecessary fromIntegral calls.
K2WordNumber-2 -- 3.20 seconds
2 simple changes, we are already getting there
3. Change quotRem to separate quot and rem
Since quotRem returns a tuple of boxed Int64, which may need to be constructed and reconstructed. Max Bolingbroke suggested that using quot and rem separately might help. Here’s the experiment result
K2WordNumber-3 -- 3.13 seconds
It might have been an improvement, but might have just been experimental error. Unfortunately it seems like not much speed up was achieved.
4. Removed redundant parameters
It turned out in my original version, there were several redundant parameters being passed recursively in the function solve (the acc label, and curr). I used ghc -Wall to track them and removed them.
K2WordNumber-4 -- 3.16 seconds
No change. Since GHC figured out enough to warn me, it probably did the right thing and optimized them for me anyway.
5. Simplified parameter passing
The original solve function looked like this
1 2 3 4 5 6 7
| solve :: Int64 -> (Int64 , Int64 , Int64 ) -> [Int64 ] -> (Int64 , Int64 , Int64 )
solve !n !acc @(!sumNum , !sumLen , !curr ) (!num:nums )
| sumLen' >= n = (sumNum' , sumLen , num )
| otherwise = solve n (sumNum' , sumLen' , num ) nums
where
sumNum' = sumNum + num
sumLen' = sumLen + wordLength num |
To quote Bryan:
You’re passing parameters in three different ways: a regular Int, a 3-tuple, and a list! Whoa.
He is right. I also noticed that the n parameter does not change and does not need to be passed on. Johan Tibell blogged that GHC could in theory create this closure for us. But I did it manually to see what happens:
1 2 3 4 5 6 7 8 9
| solve :: Int -> (Int, Int, Int)
solve n = go 0 0 1
where
go !sumNum sumLen i
| sumLen' >= n = (sumNum' , sumLen , i )
| otherwise = go sumNum' sumLen' (i +1)
where
sumNum' = sumNum + i
sumLen' = sumLen + wordLength i |
That resulted in
K2WordNumber-5 -- 2.87 seconds
Good.
6. Using unboxed integer types
I replaced quot and rem with these
1 2
| (I# a) // (I# b) = I# (a `quotInt#` b)
(I# a) % (I# b) = I# (a `remInt#` b) |
K2WordNumber-6 — 2.77 seconds
Not as much as I had thought
7. Using Data.Vector.Unboxed instead of Data.Array.Unboxed
K2WordNumber-7 -- 2.50 seconds
8. Using Data.Vector.Generic.unsafeIndex instead of the (!) operator
K2WordNumber-8 -- 2.20 seconds
Wow, this one was huge.
9. Move the length vectors inside wordLength‘s closure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| wordLength :: Int -> Int
wordLength i = go 0 i
where
go !pad !n
| n < 10 = lenOnes `VG .unsafeIndex` n + pad
| n < 20 = lenTeens `VG .unsafeIndex` (n -10) + pad
| n < 100 = go (lenTens `VG .unsafeIndex` (n //10) + pad ) (n %10 )
| n < 1000 = go (go (7+pad ) (n //100)) (n %100 )
| n < 1000000 = go (go (8+pad ) (n //1000)) (n %1000 )
| otherwise = go (go (7+pad ) (n //1000000)) (n %1000000 )
(I# a ) // (I# b ) = I# (a `quotInt#` b )
(I# a ) % (I# b ) = I# (a `remInt#` b )
lenOnes = VU .fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ...
lenTens = VU .fromList [0,3,6,6,5,5,5,7,6,6]
lenTeens = VU .fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3 |
K2WordNumber-9a — 2.20 seconds
No improvement at all… But if I add bangs before the vectors…
1 2 3
| !lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ...
!lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6]
!lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3 |
K2WordNumber-9 — 2.00 seconds
WOW!! That was a large speed up. It turns out the reason for the speed up is related to strictness. wordLength is not strict in all lenOnes, lenTens and lenTeens, so they might be garbaged collected (?). Putting bangs on them allow them to stay alive. Since we need to use those arrays very frequently, it was better for them to be around all the time.
Haskell did not allow putting bangs on “global” variables, so I missed it in my original bang sprinkling exercise.
OK, so now we have transformed my original sloppy Haskell program to a “well-performing” Haskell program. Since it performs the same as Bryan O’Sullivan (casually) tuned version, I am going to assume it’s probably as fast as it should (before maybe resorting to serious hacks).
And here is the final version of the code:
-- Problem: Find the 51000000000-th character of the string (wordNumber Infinity)
-- where a wordNumber is defined as
--
-- wordNumber 1 = "one"
-- wordNumber 2 = "onetwo"
-- wordNumber 3 = "onetwothree"
-- wordNumber 15 = "onetwothreefourfivesixseveneightnineteneleventwelvethirteenfourteenfifteen"
-- ...
--
-- The answer should be presented as ( sum of all numbers up to that point
-- , the 51000000000-th character
-- )
-- This is a Haskell performance tuning exercise, trying to achieve C like performance.
-- Full guided story can be found at
-- http://cfc.kizzx2.com/index.php/in-search-of-performance-in-haskell/
--
-- (This Word Number problem is actually a misunderstood version of ITA's version)
{-# LANGUAGE BangPatterns, MagicHash #-}
import qualified Data.Vector.Unboxed as VU
import Data.Vector.Unboxed ((!))
import qualified Data.Vector.Generic as VG
import GHC.Base (Int(..), quotInt#, remInt#)
ones, tens, teens :: [String]
ones = ["", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
tens = ["", "ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
teens = ["ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
wordify :: Int -> String
wordify n
| n < 10 = ones !! n
| n < 20 = teens !! (n-10)
| n < 100 = splitterTen
| n < 1000 = splitter 100 "hundred"
| n < 1000000 = splitter 1000 "thousand"
| otherwise = splitter 1000000 "million"
where
splitterTen = let (t, x) = n `quotRem` 10
in (tens !! t) ++ wordify x
splitter d suffix = let (t, x) = n `quotRem` d
in (wordify t) ++ suffix ++ wordify x
-- Tail recursive version
wordLength :: Int -> Int
wordLength i = go 0 i
where
go !pad !n
| n < 10 = lenOnes `VG.unsafeIndex` n + pad
| n < 20 = lenTeens `VG.unsafeIndex` (n-10) + pad
| n < 100 = go (lenTens `VG.unsafeIndex` (n//10) + pad) (n%10)
| n < 1000 = go (go (7+pad) (n//100)) (n%100)
| n < 1000000 = go (go (8+pad) (n//1000)) (n%1000)
| otherwise = go (go (7+pad) (n//1000000)) (n%1000000)
(I# a) // (I# b) = I# (a `quotInt#` b)
(I# a) % (I# b) = I# (a `remInt#` b)
!lenOnes = VU.fromList [0,3,3,5,4,4,3,5,5,4] -- "", "one","two", ...
!lenTens = VU.fromList [0,3,6,6,5,5,5,7,6,6]
!lenTeens = VU.fromList [3,6,6,8,8,7,7,9,8,8] -- first element is "ten" 3
solve :: Int -> (Int, Int, Int)
solve n = go 0 0 1
where
go !sumNum sumLen i
| sumLen' >= n = (sumNum', sumLen, i)
| otherwise = go sumNum' sumLen' (i+1)
where
sumNum' = sumNum + i
sumLen' = sumLen + wordLength i
solution :: Int -> (Int, Char)
solution x =
let (sumNum, sumLen, n) = solve x
in (sumNum, (wordify n) !! (x - sumLen - 1))
main :: IO ()
main = do
print $ solution 510000000
Moral of the story
- Haskell is a very fast functional language. But for tight areas like this, always consider giving plain old C a try — it may just save you days of profiling.
- 64-bit operations in 32-bit programs are slow (*)
quotRem is faster than divMod
- Earlier I argued in #haskell that “if
Integer vs Int matters for your performance, your program is probably written wrong”. Well, there are cases where it matters quite a lot. It turns out conversions are quite costly. (Though I still believe people should use Integer by default)
- Use the wrapper-worker pattern liberally.
- If you recurse, make sure every variable actually changes in each recursion. (For most cases this will benefit. If you only recurse several times maybe the closure’s overhead will be larger.)
- Don’t use a list or tuple just because infinite lists look smart.
Data.Vector is a better “general case” container than Data.Array
- Try to reduce the scope of “global” variables. Put them in a closure so you can bang on it.
* This may be actually the suckage of Windows’ “WOW64″ (32-bit emulation layer). I have not tried to prove this on 32-bit Linux GHC)
Extra: the fastest indexing data structure — functions!?
I heard from #haskell that “pattern matching in Haskell is less than O(n)”. When I heard it, I didn’t really believe it. I have always thought that N pattern matches means translating into N if-then-else statements for the general case. He said something related to Church encoding but I did not see how it could be applied in the general case to avoid N if-then-else statements.
Just for fun, I tried changing the length vectors into functions:
1 2 3 4 5 6
| lenOnes 0 = 0
lenOnes 1 = 3
lenOnes 2 = 3
lenOnes 3 = 5
-- ...
lenTeens 9 = 8 |
K2WordNumber-10 — 2.00 seconds
Yes, pattern matching is as fast as unsafeIndex. WOW! wren nt thornton from the comments below give a more thorough explanation on this one. It turns out GHC is really darn smart and can figure out fast code given “primitive” definitions.