Code for Concinnity

`ld: symbol(s) not found for architecture x86_64` when `pip install` on Mac

For example, you would get errors like this:

"PyUnicode_CompareWithASCIIString", referenced from:
    _extract_opcodes in _levenshtein.o
    _extract_editops in _levenshtein.o
"_PyUnicode_FromUnicode", referenced from:
    _median_improve_py in _levenshtein.o
    _apply_edit_py in _levenshtein.o
    _median_common in _levenshtein.o
"_PyUnicode_InternFromString", referenced from:
    _PyInit_levenshtein in levenshtein.o
"_PyUnicode_Type", referenced from:
    _hamming_py in _levenshtein.o
    _jaro_py in _levenshtein.o
    _jaro_winkler_py in _levenshtein.o
    _median_improve_py in _levenshtein.o
    _editops_py in _levenshtein.o
    _opcodes_py in _levenshtein.o
    _apply_edit_py in _levenshtein.o
"_Py_NoneStruct", referenced from:
    _median_improve_py in _levenshtein.o
    _median_common in _levenshtein.o

And here’s how to solve it

LDFLAGS='-undefined dynamic_lookup' pip install cx_Oracle # Or your awesome package
Published by kizzx2, on July 24th, 2017 at 1:58 pm. Filled under: UncategorizedNo Comments

`luarocks install alien` on OS X with Homebrew

If you just do luarocks install alien you may get this error:

checking for ffi_closure_alloc in -lffi... no
configure: error: in `/tmp/luarocks_alien-0.7.0-1-1080/alien-0.7.0':
configure: error: cannot find new enough libffi

The solution is to specify the lib directory with CFLAGS:

brew install libffi
CFLAGS='-L/usr/local/opt/libffi/lib' luarocks install alien
Published by kizzx2, on March 20th, 2015 at 10:50 am. Filled under: UncategorizedNo Comments

Heroku + Node.js: Cannot find module ‘base64′

Some older modules still use the base64 package. Solution:

heroku config:set NODE_PATH=node_modules/base64/build/Release/

Published by kizzx2, on January 21st, 2015 at 9:35 am. Filled under: UncategorizedNo Comments

EMFILE when doing `grunt nodewebkit` — solution

ulimit -n 9999 && grunt nodewebkit

Published by kizzx2, on December 22nd, 2014 at 11:53 pm. Filled under: UncategorizedNo Comments

Run single test with Spring in Rails 4.1

In the old days, you would run a single test with:

$ ruby -I test test/functional/my_controller_test.rb -n test_my_case

This is all good and well, but is painfully slow. Then there were brilliant people who created Spork, Zeus and Spring to speed up that process.

With version 4.1 Rails adopted Spring as the official preloader. Probably due to its pure Ruby nature. Unfortunately it is the only one of the pack that does not seem to support the single test usage.

After munging around I discovered a method that can be used for any preloader. By putting a script script/test.rb:

# script/test.rb
$: << Rails.root.join('test')
options = {} do |opts|
  opts.on('-n TEST_NAME') do |n|
    options[:test_name] = n
  opts.on('-e ENVIRONMENT') do |e|
    raise"Must run in test environment") if e != 'test'
test_files = ARGV.dup
if options[:test_name]
  ARGV << "-n" << options[:test_name]
test_files.each do |f|
  Dir[f].each do |f1|
    load f1

Usage is just like good old ruby -I test:

$ bin/rails r -e test script/test.rb test/controllers/my_controler_test.rb -n test_my_case
$ bin/rails r -e test script/test.rb test/controllers/**/*.rb

If that’s too long, you can just do something like alias rt=bin/rails r -e test script/test.rb in your shell’s rc

Published by kizzx2, on May 10th, 2014 at 1:16 pm. Filled under: UncategorizedNo Comments

Elegant Gradle + Android dependency management (even works for Ant-based projects!)

With Gradle the Java world has finally started to catch up in modern dependency management methodologies. Maven the technology has always worked but frankly, everybody who used that suffered.

The new Gradle support in Android’s build system is promising but unfortunately has a lot of rough edges. I have a lot of troubles getting Android Studio to work smoothly, let alone convert existing Ant-based projects.

However, you don’t need to convert your project if all you want is to download .jar files using Gradle. Just create build.gradle along side your build.xml:

Now you can invoke a command like

gradle libs

to get the .jar files copied into libs directory. The rest of your Ant/Eclipse workflow would then just work.

Published by kizzx2, on March 2nd, 2014 at 12:26 pm. Filled under: UncategorizedNo Comments

Illustrative elaboration — Why are Cassandra secondary indexes recommended for low cardinality attributes?

At the time of writing (as Cassandra is evolving very fast), Cassandra’s documentation recommends using its built-in secondary index only for low cardinality attributes (i.e. attributes with a few unique values).

The reason isn’t immediately obvious and the documentation doesn’t explain it in details. A quick Google search only yields this Jira ticket at the moment. Which does in fact answer the question but does it rather subtlely.

This is an attempt to clarify it from my understanding.

Secondary indexes are stored locally

The main difference between the primary index and secondary indexes are distributed indexes vs. local indexes, as mentioned in the above Jira ticket. Basically, that means that every node in the Cassandra cluster can answer the question “Which node contains the row with primary key d397bb236b2d6c3b6bc6fe36893ec1ea?” immediately.

However, secondary indexes are stored locally, as they are implemented as column families, it is not guaranteed that an arbitrary node can answer the question “Which node contains the Person with state = 'us'?” immediately. To answer that question, the node needs to go out and ask all nodes that question.

An example — low cardinality scenario

Suppose we build a secondary index for gender of Person in a 10 node cluster. Suppose you use RandomPartitioner as recommended, the data is distributed uniformly vs. gender for all nodes. That is, in normal cases every node should contain 50% males and 50% females.

Now if I issue a query “give me 100 males”. No matter which node I connect to, the first node will be able to answer the query without consulting other nodes (assuming each node has at least, for example, 1000 males and 1000 females, etc.).

If I were to issue a query “give me all females”, the first node (coordinator) will have to go out and ask all other nodes. Since all nodes contain 50% females, all nodes will give meaningful responses to the coordinator. Signal to noise ratio is high. Contrast this with the low signal to noise ratio scenario described below.

A counter example — high cardinality scenario

Now suppose we build a secondary index for street_address of Person in a 10 node cluster using RandomPartitioner.

Now if I issue a query “give me 3 people who live in 35 main st.” (Could be a family) With roughly 10% chance I contact the node that maintains the local index of “35 main st.” and it has 5 rows for “35 main st.”, then the coordinator can answer the query and be done with it.

In the other 90% of the time, though, the coordinator does not maintain the index “35 main st.”, so it has to go out and ask all nodes the question. Since only roughly 10% of the nodes has the answer, most nodes will give a meaningless response of “nope, I don’t have it”. The signal to noise ratio is very low and the overhead of such communication is high and wastes bandwidth.

Even if node A contains all people who lives in “35 main st.”, which we suppose is 5. If I were to issue a query “give me all people who live in 35 main st.”, node A is still going to have to go out and ask all nodes, because it does not know that, globally, only 5 people live in 35 main st. In this case, all nodes respond with “nope, I don’t have it” giving a signal to noise ratio of 0%.

So the conclusion is actually what Stu Hood mentioned in the Jira ticket:

Local indexes are better for:

– Low cardinality fields – Filtering of values in base order

Distributed indexes are better for:

– High cardinality fields – Querying of values in index order

That’s how I understood it. Hope it helps (or doesn’t hurt, at least).

Published by kizzx2, on August 18th, 2013 at 12:47 pm. Filled under: UncategorizedNo Comments

Slow batch insert with MongoDB sharding and how I debugged it

Now this title sounds fairly technical and seems to belong in a bug ticket rather than a general blog post. But I want to write about it anyway because it took me a couple of hours to find out and it highlights how immature MongoDB in general is. In the end I give a solution that solves the issue so you can still get performance with sharding+batch insert.

So here we go:

A couple of weeks ago I was hanging out at the #mongodb IRC channel and troubleshooting performance issue with a guy. So he had a beefy 32 GB server with 8 cores but it was taking 20 seconds for him to insert 20000 documents as simple as this:

{ t: "2013-06-23", n: 7, a: 1, v: 0, j: "1234" }

So I wrote a quick script (included below) to try it on my MacBook Pro with SSD, and I was able to get results like this:

20000 documents inserted
Took 580ms
34482.75862068966 doc/s

So something must be wrong with his configuration / code, I thought, and I kept telling him to just run my code on his machine.

Turns out it’s due to MongoDB sharding failing with batch inserts

So it turns out performance dropped drastically also for me after I enabled sharding:

20000 documents inserted
Took 15701ms
1273.8042162919558 doc/s

The test set up

Here’s what I used to test:

  • 3 config servers
  • 8 shards, sharding on { _id: "hashed" }
  • All on localhost

Here is the setup script I used to create 8 shards on my localhost. (By the way, setting up sharding is painful)

The test script

The test script is as simple as possible — just a normal batch insert:

// Quite a lot of orchestration
var count0 = db.foos.find().count();
var t0 =;
var docs = [];
for(var i = 0; i < 20000; i++) {
  docs.push({ t: "2013-06-23", n: 7 * i, a: 1, v: 0, j: "1234" });
// And actually just these couple of lines are the real action
var t1 =;
var count1 = db.foos.find().count();
var took = t1 - t0;
var count = count1 - count0;
var throughput = count / took * 1000;
print(count + " documents inserted");
print("Took " + took + "ms");
print(throughput + " doc/s");

How I systematically discovered the problem

By passing the -v option to mongod and doing something like tail -f mongolab/**/*.log, I saw tons of logs like this:

==> mongolab/sh5/mongo.log <==
Sun Aug 18 10:12:23.136 [conn2] run command admin.$cmd { getLastError: 1 }
Sun Aug 18 10:12:23.136 [conn2] command admin.$cmd command: { getLastError: 1 } ntoreturn:1 keyUpdates:0  reslen:67 0ms

==> mongolab/sh6/mongo.log <== Sun Aug 18 10:12:23.136 [conn2] run command admin.$cmd { getLastError: 1 } Sun Aug 18 10:12:23.136 [conn2] command admin.$cmd command: { getLastError: 1 } ntoreturn:1 keyUpdates:0 reslen:67 0ms

==> mongolab/sh7/mongo.log <== Sun Aug 18 10:12:23.137 [conn2] run command admin.$cmd { getLastError: 1 } Sun Aug 18 10:12:23.137 [conn2] command admin.$cmd command: { getLastError: 1 } ntoreturn:1 keyUpdates:0 reslen:67 0ms


So mongos is splitting up the batch insert into individual inserts and donig it one by one, with a getLastError() accompanying each of them!

How to prove that it’s related to batch insert

I changed my test script to do sequential insert, and it worked out fine (note that this is still slower than non-sharded batch insert):

20000 documents inserted
Took 1746ms
11454.75372279496 doc/s

The moral of the lesson is that if you shard — you should benchmark very carefully if you do batch inserts.

So can I shard and still get good batch insert performance?

I figured out a way to still get good batch insert performance by using a numeric (instead of hashed) shard key:

  • sh.shardCollection("shard_test.foos", {rnd: 1})
  • Pre-chunk the collection on rnd
  • On each document, generate the rnd key db.foos.insert({ rnd: _rand(), t: ...
  • Before inserting the documents, sort the doucment array so that mongos will only send N batch inserts, if you have N shards

(The code to do all this is available in GitHub, to avoid flooding this post with code snippets)

So instead of letting mongos calculate and sort the hash keys before sending the inserts, I have to do this all by myself. This is fairly basic and I am totally shocked that it could have been solved just like that.

The last step (sorting) is also required. Apparently mongos is not smart enough to sort the batch insert to optimize its own operation.

Benchmark Summary

Time in ms (lower is better).
                       No-Shard     No-Shard (with rnd)     Shard { id: "hashed" }  Shard { rnd: 1 }
Batch insert           640      740                     21038                   1004
Normal               1404       1468                    1573                      1790
(sequential) insert

Note that even with the rnd insert is still slower than the non-sharded version. Granted I sharded all on a single machine but this about shows the general non-zero overhead of sharding.

Published by kizzx2, on August 18th, 2013 at 11:49 am. Filled under: UncategorizedNo Comments

Thunderbird hanging with 100% CPU usage — culprit discovered — the Google Contacts add-on

Several months ago I started using the Google Contacts add-on for Thunderbird and it’s been hanging periodically ever since. I just thought it’s Thunderbird and tried all those compacting folder, resetting the cache thing.

Today it hang again and I was frustrated, so I did dtruss (strace) on it and found that it seemed to be stuck in an infinite loop trying to access my Google contact feed.

I replaced it with the gContactSync add-on and it’s working smoothly now.

Published by kizzx2, on May 6th, 2013 at 9:35 pm. Filled under: UncategorizedNo Comments

Windows takes 10+ minutes to boot with black screen — Solved!

Just a quick note for anyone struggling with this. This is talking about the infamous case after the Windows logo, there is a black screen for 10 minutes until Windows shows the welcome screen. Caps Lock key is unresponsive during that time.

After struggling on and off for a couple of weeks, the culprit was discovered to be a freaking font file!

Failing with Windows EFS

It turned out that when I installed .otf files from my Desktop, which is EFS encrypted, the font files copied to C:\Windows\Fonts appeared to be encrypted as well. Because the Windows system doesn’t have the EFS key, it would stall in the black screen trying and failing to decrypt the file.

How I solved it

Just decrypt the files before installing them, duh!

Published by kizzx2, on May 6th, 2013 at 9:35 pm. Filled under: UncategorizedNo Comments