Apache Cassandra Blog

Apache Cassandra Architecture – A Bias For Action


An Introduction To Apache Cassandra , launched us to numerous forms of NoSQL database and Apache Cassandra. On this article I am going to delve into Cassandra’s Architecture.

Cassandra is a peer-to-peer distributed database that runs on a cluster of homogeneous nodes. Cassandra has been architected from the ground up to handle giant volumes of knowledge whereas providing high availability.  Cassandra supplies excessive write and skim throughput.  A Cassandra cluster has no particular nodes i.e. the cluster has no masters, no slaves or elected leaders. This permits Cassandra to be highly out there whereas having no single level of failure.

Key Ideas, Knowledge Buildings and Algorithms

To be able to perceive Cassandra’s architecture it is very important understand some key ideas, knowledge buildings and algorithms incessantly used by Cassandra.

  • Knowledge Partitioning –  Apache Cassandra is a distributed database system using a shared nothing structure. A single logical database is spread throughout a cluster of nodes and thus the need to unfold knowledge evenly amongst all collaborating nodes. At a 10000 foot degree Cassandra shops knowledge by dividing knowledge evenly around its cluster of nodes. Each node is answerable for part of the info. The act of distributing knowledge across nodes is referred to as knowledge partitioning.
  • Constant Hashing – Two important issues crop up when making an attempt to distribute knowledge efficiently. One, figuring out a node on which a selected piece of knowledge ought to reside on. Two, minimising knowledge motion when including or removing nodes. Constant hashing allows us to realize these objectives. A consistent hashing algorithm allows us to map Cassandra row keys to physical nodes. The vary of values from a consistent hashing algorithm is a hard and fast round area which may be visualised as a ring. Constant hashing additionally minimises the important thing movements when nodes be a part of or depart the cluster. On average solely okay/n keys must be remapped the place okay is the number of keys and n is the variety of slots (nodes). That is in stark contrast to most hashing algorithms where a change in the number of slots leads to the necessity to remap a giant number of keys.
  • Knowledge Replication – Partitioning of knowledge on a shared nothing system leads to a single level of failure i.e. if one of many nodes goes down a part of your knowledge is unavailable. This limitation is overcome by creating copies of the info, know as replicas, thus avoiding a single point of failure. Storing copies of knowledge on multiple nodes is known as replication.  Replication of knowledge ensures fault tolerance and reliability.
  • Eventual Consistency – Since knowledge is replicated across nodes we need to be sure that knowledge is synchronized across replicas. This is known as knowledge consistency.  Eventual consistency is a consistency mannequin utilized in distributed computing. It theoretically ensures that, offered there are not any new updates, all nodes/replicas will ultimately return the final updated value. Area Identify System (DNS) are a superb instance of an ultimately consistent system.
  • Tunable Consistency – Cassandra offers tunable consistency i.e. users can determine the consistency degree by tuning it by way of read and write operations. Eventual consistency typically conjures up worry and doubt in the minds of software builders. The important thing factor to remember is that reaching a constant state typically takes microseconds.
  • Consistency Degree – Cassandra allows users to configure the number of replicas in a cluster that must acknowledge a learn or write operation before considering the operation successful. The consistency degree is a required parameter in any learn and write operation and determines the exact variety of nodes that should successfully full the operation earlier than contemplating the operation successful.
  • Knowledge Centre, Racks, Nodes – A Knowledge Centre (DC) is a centralised place to deal with pc and networking techniques to assist meet an organisation’s info know-how needs. A rack is a unit that incorporates a number of servers all stacked one on prime of one other. A rack allows knowledge centres to preserve flooring area and consolidates networked assets. A node is a single server in a rack. Why can we care? Typically Cassandra is deployed in a DC setting and one must replicate knowledge intelligently to make sure no single level of failure. Knowledge have to be replicated to servers in several racks to make sure continued availability in the case of rack failure. Cassandra could be simply configured to work in a multi DC setting to facilitate fail over and catastrophe restoration.
  • Snitches and Replication Methods – As mentioned above it is very important intelligently distribute knowledge throughout DC’s and racks. In Cassandra the distribution of knowledge across nodes is configurable. Cassandra makes use of snitches and replication methods to find out how knowledge is replicated across DC’s, racks and nodes. Snitches decide proximity of nodes inside a ring. Replication methods use proximity info offered by snitches to find out locality of a specific copy.
  • Gossip Protocol – Cassandra makes use of a gossip protocol to discover node state for all nodes in a cluster.  Nodes uncover details about other nodes by exchanging state information about themselves and other nodes they find out about. That is finished with a maximum of three different nodes. Nodes don’t trade info with every different node within the cluster in an effort to scale back community load. They only trade info with a couple of nodes and over a time period state information about every node propagates throughout the cluster. The gossip protocol facilitates failure detection.
  • Bloom Filters –  A bloom filter is a particularly quick option to check the existence of a knowledge construction in a set. A bloom filter can inform if an item may exist in a set or undoubtedly doesn’t exist within the set. False positives are potential however false negatives are usually not. Bloom filters are a good way of avoiding expensive I/O operation.
  • Merkle Tree – Merkle tree is a hash tree which offers an environment friendly solution to find differences in knowledge blocks. Leaves include hashes of particular person knowledge blocks and mum or dad nodes include hashes of their respective youngsters. This permits environment friendly means of discovering differences between nodes.
  • SSTable – A Sorted String Table (SSTable) ordered immutable key worth map. It is principally an efficient means of storing giant sorted knowledge segments in a file.
  • Write Back Cache – A write back cache is the place the write operation is simply directed to the cache and completion is instantly confirmed. That is totally different from Write-through cache where the write operation is directed on the cache however is simply confirmed as soon as the info is written to each the cache and the underlying storage structure.
  • Memtable – A memtable is a write again cache residing in reminiscence which has not been flushed to disk but.
  • Cassandra Keyspace – Keyspace is just like a schema in the RDBMS world. A keyspace is a container for all of your software knowledge. When defining a keyspace, you have to specify a replication technique and a replication issue i.e. the number of nodes that the info have to be replicate too.
  • Column Family – A column household is analogous to the idea of a table in an RDBMS. But that is where the similarity ends. As an alternative of considering of a column household as RDBMS desk think of a column household as a map of sorted map. A row in the map offers entry to a set of columns which is represented by a sorted map.Map<RowKey, SortedMap>

    Please observe in CQL (Cassandra Question Language) lingo a Column Household is known as a table.

  • Row Key – A row secret is also called the partition key and has numerous columns associated with it i.e. a sorted map as shown above. The row secret is liable for figuring out knowledge distribution throughout a cluster.

Cassandra Cluster/Ring

Cluster Bootstrapping

Each Cassandra cluster have to be assigned a identify. All nodes collaborating in a cluster have the same identify. Seed nodes are used during start up to assist uncover all collaborating nodes. Seeds nodes haven’t any particular function aside from serving to bootstrap the cluster utilizing the gossip protocol. When a node begins up it seems to its seed listing to acquire information about the other nodes within the cluster. Cassandra makes use of the gossip protocol for intra cluster communication and failure detection. A node exchanges state info with a maximum of three different nodes. State info is exchanged every second and accommodates details about itself and all different recognized nodes.  This permits every node to study each other node in the cluster regardless that it’s communicating with a small subset of nodes.

Cassandra Ring

Instance Cassandra ring distributing 255 tokens evenly throughout four nodes.

A Cassandra cluster is visualised as a ring because it uses a consistent hashing algorithm to distribute knowledge. At start up every node is assigned a token range which determines its place in the cluster and the fashion of knowledge stored by the node. Every node receives a proportionate vary of the token ranges to make sure that knowledge is unfold evenly throughout the ring. The figure above illustrates dividing a 0 to 255 token range evenly amongst a four node cluster. Each node is assigned a token and is answerable for token values from the earlier token (unique) to the node’s token (inclusive). Each node in a Cassandra cluster is liable for a certain set of knowledge which is decided by the partitioner. A partitioner is a hash perform for computing the resultant token for a specific row key. This token is then used to find out the node which can retailer the first duplicate.  At present Cassandra provides a Murmur3Partitioner (default), RandomPartitioner and a ByteOrderedPartitioner.


Cassandra also replicates knowledge based on the chosen replication strategy. The replication strategy determines placement of the replicated knowledge.  There are two major replication methods used by Cassandra, Easy Technique and the Community Topology Technique. The first duplicate for the info is decided by the partitioner. The location of the next replicas is decided by the replication technique. The straightforward strategy places the next replicas on the subsequent node in a clockwise method. The network topology strategy works properly when Cassandra is deployed across knowledge centres. The  community topology strategy is knowledge centre conscious and makes positive that replicas usually are not stored on the identical rack.

Cassandra makes use of snitches to discover the overall community general topology.  This info is used to efficiently route inter-node requests inside the bounds of the duplicate placement strategy.

Cassandra Write Path

Lets attempt to perceive Cassandra’s structure by strolling by means of an instance write mutation.

Let’s assume that a shopper wishes to put in writing a bit of knowledge to the database. The diagram under illustrates the cluster degree interaction that takes place.

Cassandra Write Path Ring

Cluster degree interaction for a write and skim operation

Since Cassandra is masterless a shopper can join with any node in a cluster. Shoppers can interface with a Cassandra node using both a thrift protocol or utilizing CQL. Within the image above the shopper has related to Node four. The node that a shopper connects to is designated because the coordinator, additionally illustrated in the diagram. The coordinators is answerable for satisfying the shoppers request. The consistency degree determines the variety of nodes that the coordinator wants to hear from so as to notify the shopper of a profitable mutation.  All inter-node requests are sent via a messaging service and in an asynchronous manner.

Based mostly on the partition key and the replication technique used the coordinator forwards the mutation to all relevant nodes. In our instance it’s assumed that nodes 1,2 and 3 are the relevant nodes where node 1 is the first duplicate and nodes two and three are subsequent replicas. The coordinator will watch for a response from the suitable number of nodes required to satisfy the consistency degree.  QUORUM is a commonly used consistency degree which refers to a majority of the nodes.QUORUM may be calculated using the method (n/2 +1) the place n is the replication factor. In our example let’s assume that we have now a consistency degree of QUORUM and a replication factor of three. Thus the coordinator will look forward to at most 10 seconds (default setting) to hear from a minimum of two nodes before informing the shopper of a successful mutation.

Cassandra Write Path

Write operations at a node degree

Every node processes the request individually. Every node first writes the mutation to the commit log and then writes the mutation to the memtable. Writing to the commit log ensures durability of the write as the memtable is an in-memory structure and is just written to disk when the memtable is flushed to disk. A memtable is flushed to disk when:

  1. It reaches its maximum allotted measurement in memory
  2. The variety of minutes a memtable can stay in reminiscence elapses.
  3. Manually flushed by a consumer

A memtable is flushed to an immutable construction referred to as and SSTable (Sorted String Table). The commit log is used for playback purposes in case knowledge from the memtable is lost resulting from node failure. For example the machine has a power outage before the memtable might get flushed. Every SSTable creates three information on disk which embrace a bloom filter, a key index and a knowledge file.

Over a time period quite a few SSTables are created. This leads to the necessity to learn a number of SSTables to fulfill a read request. Compaction is the method of combining SSTables so that associated knowledge could be found in a single SSTable. This helps with making reads a lot quicker.

Cassandra Read Path

On the cluster degree a read operation is just like a write operation. As with the write path the shopper can connect with any node in the cluster. The chosen node known as the coordinator and is chargeable for returning the requested knowledge.  A row key have to be provided for each learn operation. The coordinator uses the row key to find out the first duplicate. The replication technique together with the replication factor is used to find out all different relevant replicas. As with the write path the consistency degree determines the variety of duplicate’s that should reply earlier than efficiently returning knowledge. Let’s assume that the request has a consistency degree of QUORUM and a replication factor of three, thus requiring the coordinator to await profitable replies from at the very least two nodes. If the contacted replicas has a totally different model of the info the coordinator returns the newest version to the shopper and issues a read restore command to the node/nodes with the older model of the info. The learn restore operation pushes the newer version of the info to nodes with the older version.

Cassandra Read Path Overview

Node degree read operation

The illustration above outlines key steps when studying knowledge on a specific node. Each Column Household shops knowledge in quite a few SSTables. Thus Knowledge for a specific row might be situated in quite a few SSTables and the memtable. Thus for each read request Cassandra needs to read knowledge from all applicable SSTables ( all SSTables for a column family) and scan the memtable for applicable knowledge fragments. This knowledge is then merged and returned to the coordinator.

Cassandra Read Path

SSTable read path

On a per SSTable basis the operation turns into a bit extra difficult. The illustration above outlines key steps that happen when studying knowledge from an SSTable. Every SSTable has an associated bloom filter which allows it to shortly confirm if knowledge for the requested row key exists on the corresponding SSTable. This reduces IO when performing an row key lookup. A bloom filter is all the time held in memory because the entire objective is to save lots of disk IO. Cassandra additionally retains a replica of the bloom filter on disk which allows it to recreate the bloom filter in reminiscence shortly .  Cassandra doesn’t store the bloom filter Java Heap as an alternative makes a separate allocation for it in reminiscence.  If the bloom filter returns a destructive response no knowledge is returned from the actual SSTable. This is  a standard case because the compaction operation tries to group all row key associated knowledge into as few SSTables as attainable. If the bloom filter supplies a constructive response the partition key cache is scanned to determine the compression offset for the requested row key. It then proceeds to fetch the compressed knowledge on disk and returns the end result set. If the partition cache doesn’t include a corresponding entry the partition key abstract is scanned. The partition summary is a subset to the partition index and helps decide the approximate location of the index entry within the partition index. The partition index is then scanned to find the compression offset which is then used to seek out the suitable knowledge on disk.

For those who reached the top of this long publish then nicely executed. In this publish I have offered an introduction to Cassandra architecture. In my upcoming posts I will attempt to clarify Cassandra architecture using a extra sensible strategy.