Running NEO-CLI as a daemon
Surprisingly I couldn’t find an easy way to do it, so I made a Docker image to do that in one line:
docker run --name neo-cli -d --rm -it -p 10332:10332 -v $PWD/Chain:/neo-cli/Chain kizzx2/neo-cli
Surprisingly I couldn’t find an easy way to do it, so I made a Docker image to do that in one line:
docker run --name neo-cli -d --rm -it -p 10332:10332 -v $PWD/Chain:/neo-cli/Chain kizzx2/neo-cli
For example, you would get errors like this:
"PyUnicode_CompareWithASCIIString", referenced from: _extract_opcodes in _levenshtein.o _extract_editops in _levenshtein.o "_PyUnicode_FromUnicode", referenced from: _median_improve_py in _levenshtein.o _apply_edit_py in _levenshtein.o _median_common in _levenshtein.o "_PyUnicode_InternFromString", referenced from: _PyInit_levenshtein in levenshtein.o "_PyUnicode_Type", referenced from: _hamming_py in _levenshtein.o _jaro_py in _levenshtein.o _jaro_winkler_py in _levenshtein.o _median_improve_py in _levenshtein.o _editops_py in _levenshtein.o _opcodes_py in _levenshtein.o _apply_edit_py in _levenshtein.o ... "_Py_NoneStruct", referenced from: _median_improve_py in _levenshtein.o _median_common in _levenshtein.o
And here’s how to solve it
LDFLAGS='-undefined dynamic_lookup' pip install cx_Oracle # Or your awesome package
If you just do luarocks install alien you may get this error:
checking for ffi_closure_alloc in -lffi... no configure: error: in `/tmp/luarocks_alien-0.7.0-1-1080/alien-0.7.0': configure: error: cannot find new enough libffi
The solution is to specify the lib directory with CFLAGS:
brew install libffi CFLAGS='-L/usr/local/opt/libffi/lib' luarocks install alien
Some older modules still use the base64
package. Solution:
heroku config:set NODE_PATH=node_modules/base64/build/Release/
ulimit -n 9999 && grunt nodewebkit
In the old days, you would run a single test with:
$ ruby -I test test/functional/my_controller_test.rb -n test_my_case
This is all good and well, but is painfully slow. Then there were brilliant people who created Spork, Zeus and Spring to speed up that process.
With version 4.1 Rails adopted Spring as the official preloader. Probably due to its pure Ruby nature. Unfortunately it is the only one of the pack that does not seem to support the single test usage.
After munging around I discovered a method that can be used for any preloader. By putting a script script/test.rb:
# script/test.rb $: << Rails.root.join('test') options = {} OptionParser.new do |opts| opts.on('-n TEST_NAME') do |n| options[:test_name] = n end opts.on('-e ENVIRONMENT') do |e| raise ArgumentError.new("Must run in test environment") if e != 'test' end end.parse! test_files = ARGV.dup ARGV.clear if options[:test_name] ARGV << "-n" << options[:test_name] end test_files.each do |f| Dir[f].each do |f1| load f1 end end
Usage is just like good old ruby -I test
:
$ bin/rails r -e test script/test.rb test/controllers/my_controler_test.rb -n test_my_case $ bin/rails r -e test script/test.rb test/controllers/**/*.rb
If that’s too long, you can just do something like alias rt=bin/rails r -e test script/test.rb
in your shell’s rc
With Gradle the Java world has finally started to catch up in modern dependency management methodologies. Maven the technology has always worked but frankly, everybody who used that suffered.
The new Gradle support in Android’s build system is promising but unfortunately has a lot of rough edges. I have a lot of troubles getting Android Studio to work smoothly, let alone convert existing Ant-based projects.
However, you don’t need to convert your project if all you want is to download .jar files using Gradle. Just create build.gradle along side your build.xml
:
Now you can invoke a command like
gradle libs
to get the .jar
files copied into libs directory. The rest of your Ant/Eclipse workflow would then just work.
At the time of writing (as Cassandra is evolving very fast), Cassandra’s documentation recommends using its built-in secondary index only for low cardinality attributes (i.e. attributes with a few unique values).
The reason isn’t immediately obvious and the documentation doesn’t explain it in details. A quick Google search only yields this Jira ticket at the moment. Which does in fact answer the question but does it rather subtlely.
This is an attempt to clarify it from my understanding.
The main difference between the primary index and secondary indexes are distributed indexes vs. local indexes, as mentioned in the above Jira ticket. Basically, that means that every node in the Cassandra cluster can answer the question “Which node contains the row with primary key d397bb236b2d6c3b6bc6fe36893ec1ea
?” immediately.
However, secondary indexes are stored locally, as they are implemented as column families, it is not guaranteed that an arbitrary node can answer the question “Which node contains the Person with state = 'us'
?” immediately. To answer that question, the node needs to go out and ask all nodes that question.
Suppose we build a secondary index for gender of Person in a 10 node cluster. Suppose you use RandomPartitioner as recommended, the data is distributed uniformly vs. gender for all nodes. That is, in normal cases every node should contain 50% males and 50% females.
Now if I issue a query “give me 100 males”. No matter which node I connect to, the first node will be able to answer the query without consulting other nodes (assuming each node has at least, for example, 1000 males and 1000 females, etc.).
If I were to issue a query “give me all females”, the first node (coordinator) will have to go out and ask all other nodes. Since all nodes contain 50% females, all nodes will give meaningful responses to the coordinator. Signal to noise ratio is high. Contrast this with the low signal to noise ratio scenario described below.
Now suppose we build a secondary index for street_address
of Person in a 10 node cluster using RandomPartitioner
.
Now if I issue a query “give me 3 people who live in 35 main st.” (Could be a family) With roughly 10% chance I contact the node that maintains the local index of “35 main st.” and it has 5 rows for “35 main st.”, then the coordinator can answer the query and be done with it.
In the other 90% of the time, though, the coordinator does not maintain the index “35 main st.”, so it has to go out and ask all nodes the question. Since only roughly 10% of the nodes has the answer, most nodes will give a meaningless response of “nope, I don’t have it”. The signal to noise ratio is very low and the overhead of such communication is high and wastes bandwidth.
Even if node A contains all people who lives in “35 main st.”, which we suppose is 5. If I were to issue a query “give me all people who live in 35 main st.”, node A is still going to have to go out and ask all nodes, because it does not know that, globally, only 5 people live in 35 main st. In this case, all nodes respond with “nope, I don’t have it” giving a signal to noise ratio of 0%.
So the conclusion is actually what Stu Hood mentioned in the Jira ticket:
Local indexes are better for:
– Low cardinality fields – Filtering of values in base order
Distributed indexes are better for:
– High cardinality fields – Querying of values in index order
That’s how I understood it. Hope it helps (or doesn’t hurt, at least).
Now this title sounds fairly technical and seems to belong in a bug ticket rather than a general blog post. But I want to write about it anyway because it took me a couple of hours to find out and it highlights how immature MongoDB in general is. In the end I give a solution that solves the issue so you can still get performance with sharding+batch insert.
So here we go:
A couple of weeks ago I was hanging out at the #mongodb IRC channel and troubleshooting performance issue with a guy. So he had a beefy 32 GB server with 8 cores but it was taking 20 seconds for him to insert 20000 documents as simple as this:
{ t: "2013-06-23", n: 7, a: 1, v: 0, j: "1234" }
So I wrote a quick script (included below) to try it on my MacBook Pro with SSD, and I was able to get results like this:
20000 documents inserted Took 580ms 34482.75862068966 doc/s
So something must be wrong with his configuration / code, I thought, and I kept telling him to just run my code on his machine.
So it turns out performance dropped drastically also for me after I enabled sharding:
20000 documents inserted Took 15701ms 1273.8042162919558 doc/s
The test set up
Here’s what I used to test:
{ _id: "hashed" }
Here is the setup script I used to create 8 shards on my localhost. (By the way, setting up sharding is painful)
The test script is as simple as possible — just a normal batch insert:
// Quite a lot of orchestration var count0 = db.foos.find().count(); var t0 = Date.now(); var docs = []; for(var i = 0; i < 20000; i++) { docs.push({ t: "2013-06-23", n: 7 * i, a: 1, v: 0, j: "1234" }); } // And actually just these couple of lines are the real action db.foos.insert(docs); db.getLastError(); var t1 = Date.now(); var count1 = db.foos.find().count(); var took = t1 - t0; var count = count1 - count0; var throughput = count / took * 1000; print(count + " documents inserted"); print("Took " + took + "ms"); print(throughput + " doc/s");
How I systematically discovered the problem
By passing the -v option to mongod and doing something like tail -f mongolab/**/*.log
, I saw tons of logs like this:
==> mongolab/sh5/mongo.log <== Sun Aug 18 10:12:23.136 [conn2] run command admin.$cmd { getLastError: 1 } Sun Aug 18 10:12:23.136 [conn2] command admin.$cmd command: { getLastError: 1 } ntoreturn:1 keyUpdates:0 reslen:67 0ms==> mongolab/sh6/mongo.log <== Sun Aug 18 10:12:23.136 [conn2] run command admin.$cmd { getLastError: 1 } Sun Aug 18 10:12:23.136 [conn2] command admin.$cmd command: { getLastError: 1 } ntoreturn:1 keyUpdates:0 reslen:67 0ms
==> mongolab/sh7/mongo.log <== Sun Aug 18 10:12:23.137 [conn2] run command admin.$cmd { getLastError: 1 } Sun Aug 18 10:12:23.137 [conn2] command admin.$cmd command: { getLastError: 1 } ntoreturn:1 keyUpdates:0 reslen:67 0ms
...
So mongos
is splitting up the batch insert into individual inserts and donig it one by one, with a getLastError() accompanying each of them!
I changed my test script to do sequential insert, and it worked out fine (note that this is still slower than non-sharded batch insert):
20000 documents inserted Took 1746ms 11454.75372279496 doc/s
The moral of the lesson is that if you shard — you should benchmark very carefully if you do batch inserts.
I figured out a way to still get good batch insert performance by using a numeric (instead of hashed) shard key:
sh.shardCollection("shard_test.foos", {rnd: 1})
rnd
db.foos.insert({ rnd: _rand(), t: ...
(The code to do all this is available in GitHub, to avoid flooding this post with code snippets)
So instead of letting mongos calculate and sort the hash keys before sending the inserts, I have to do this all by myself. This is fairly basic and I am totally shocked that it could have been solved just like that.
The last step (sorting) is also required. Apparently mongos is not smart enough to sort the batch insert to optimize its own operation.
Time in ms (lower is better). No-Shard No-Shard (with rnd) Shard { id: "hashed" } Shard { rnd: 1 } Batch insert 640 740 21038 1004 Normal 1404 1468 1573 1790 (sequential) insert
Note that even with the rnd insert is still slower than the non-sharded version. Granted I sharded all on a single machine but this about shows the general non-zero overhead of sharding.