At the time of writing (as Cassandra is evolving very fast), Cassandra’s documentation recommends using its built-in secondary index only for low cardinality attributes (i.e. attributes with a few unique values).
The reason isn’t immediately obvious and the documentation doesn’t explain it in details. A quick Google search only yields this Jira ticket at the moment. Which does in fact answer the question but does it rather subtlely.
This is an attempt to clarify it from my understanding.
Secondary indexes are stored locally
The main difference between the primary index and secondary indexes are distributed indexes vs. local indexes, as mentioned in the above Jira ticket. Basically, that means that every node in the Cassandra cluster can answer the question “Which node contains the row with primary key
However, secondary indexes are stored locally, as they are implemented as column families, it is not guaranteed that an arbitrary node can answer the question “Which node contains the Person with
state = 'us'?” immediately. To answer that question, the node needs to go out and ask all nodes that question.
An example — low cardinality scenario
Suppose we build a secondary index for gender of Person in a 10 node cluster. Suppose you use RandomPartitioner as recommended, the data is distributed uniformly vs. gender for all nodes. That is, in normal cases every node should contain 50% males and 50% females.
Now if I issue a query “give me 100 males”. No matter which node I connect to, the first node will be able to answer the query without consulting other nodes (assuming each node has at least, for example, 1000 males and 1000 females, etc.).
If I were to issue a query “give me all females”, the first node (coordinator) will have to go out and ask all other nodes. Since all nodes contain 50% females, all nodes will give meaningful responses to the coordinator. Signal to noise ratio is high. Contrast this with the low signal to noise ratio scenario described below.
A counter example — high cardinality scenario
Now suppose we build a secondary index for
street_address of Person in a 10 node cluster using
Now if I issue a query “give me 3 people who live in 35 main st.” (Could be a family) With roughly 10% chance I contact the node that maintains the local index of “35 main st.” and it has 5 rows for “35 main st.”, then the coordinator can answer the query and be done with it.
In the other 90% of the time, though, the coordinator does not maintain the index “35 main st.”, so it has to go out and ask all nodes the question. Since only roughly 10% of the nodes has the answer, most nodes will give a meaningless response of “nope, I don’t have it”. The signal to noise ratio is very low and the overhead of such communication is high and wastes bandwidth.
Even if node A contains all people who lives in “35 main st.”, which we suppose is 5. If I were to issue a query “give me all people who live in 35 main st.”, node A is still going to have to go out and ask all nodes, because it does not know that, globally, only 5 people live in 35 main st. In this case, all nodes respond with “nope, I don’t have it” giving a signal to noise ratio of 0%.
So the conclusion is actually what Stu Hood mentioned in the Jira ticket:
Local indexes are better for:
– Low cardinality fields
– Filtering of values in base order
Distributed indexes are better for:
– High cardinality fields
– Querying of values in index order
That’s how I understood it. Hope it helps (or doesn’t hurt, at least).