« Back to blog

Cassandra Hinted Handoff

One of the ways that Cassandra ensures high availability, fault tolerance and graceful degradation is through the process of Hinted Handoff. When Cassandra receives a write operation that is designated to be stored in a node that has failed, Cassandra will automatically route the write request to a node that is alive. The node that receives the write request is instructed to save the write operation with a hint. The hint is a message that contains the write request and information about the failed node that should have handled the write request. At this point, the node holds the hint, will monitor the node ring for the recovery of the failed node that missed the write request. Once the failed node is back online, the node that holds the hint will handoff the hint message to the recovered node so that the write request can be persisted in its proper location. The way the node detects the recovery of the failed node is through the Gossip feature of Cassandra.

 

In this kind of scenario performance and consistency depend on the specified consistency level defined in the write operation. For example, if the consistency level is set to ANY, then the write operation will respond to the requester immediately after the write has been written to the node that handled the write requests commit log. This would obviously result in fast performance but low consistency since follow up reads would be out of sync in the node ring. However, if the consistency level is set to anything else like ONE, QUORUM, ALL, then the write would take much longer to complete affecting the performance but ensuring that follow up reads are consistent.

Many in the Cassandra community have expressed concerns about Hinted Handoff for the following reason. When a failed node is down for a long period of time and many write operation occur its possible to have a large amount of hints accumulate. Once the failed node recovers and comes back online, all the nodes holding the write hints will detect the recovery and flood the recovered node with requests. This combined with new requests could cause the recovered node to become unavailable.

As a result it is now possible to disable Hinted Handoffs completely or to simply reduce the priority of the hinted handoff messages relative to new incoming requests.