System Design: Key Value Store

Key Value Store or Key Value Database

Is is a non-relational database where each unique identifier is stored as a key with its associated value, data as stored as key-value pairs. Both key and value can be of any object type ranging from simple to complex compound objects.

Key must be unique in a key value pair and value associated to key is accessed through the key.
Key values can be string, integer, list, objects, etc. Value is treated as an unknown object in key-value store. Some examples of Key-Value stores are Memcached, Redis, etc. Kay value databases are highly partitionable and can be scaled horizontally to large extent compared to other databases.

The key-value store that we will design in session will perform functional requirements:

put(key ,value) : Adding key with value to the key-value store
get(key) : Get value associated with key

Key-Value store will have following non-functional requirements:

High Scalability: It should be able to handle sudden increase in load without issues by increasing instances.
Consistency: It should provide consistent and correct response always.
Durability: Data loss should not be there in case of network issues/ partition failure.
Availability: It should be highly available and as per CAP theorem it should be CP system i.e. Consistency should take precedence over availability.

CAP Theorem

The CAP theorem mentions that a distributed system can deliver only two of three desired characteristics: consistency, availability, and partition tolerance.

Consistency

Consistency means that all clients see the same data at the same time, no matter which node they connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or replicated to all the other nodes in the system before the write is deemed ‘successful.’

Availability

Availability means that any client making a request for data gets a response, even if one or more nodes are down. Another way to state this—all working nodes in the distributed system return a valid response for any request, without exception.

Partition tolerance

A partition is a communications break within a distributed system—a lost or temporarily delayed connection between two nodes. Partition tolerance means that the cluster must continue to work despite any number of communication breakdowns between nodes in the system.

CP (Consistency and Partition Tolerance) systems: A CP system delivers consistency and partition tolerance at the expense of availability. When a partition occurs between any two nodes, the system has to shut down the non-consistent node (i.e., make it unavailable) until the partition is resolved.

AP (Availability and Partition Tolerance) systems: An AP system delivers availability and partition tolerance at the expense of consistency. When a partition occurs, all nodes remain available but those at the wrong end of a partition might return an older version of data than others. (When the partition is resolved, the AP systems typically resync the nodes to repair all inconsistencies in the system.)

CA (Consistency and Availability) systems: A CA system delivers consistency and availability across all nodes. It can’t do this if there is a partition between any two nodes in the system, however, and therefore can’t deliver fault tolerance. Since network failure is unavoidable a distributed system must tolerate network partition. CA system can not exist in ideal real-world distributed applications.

Which one to choose now CP or AP?
If consistency over availability is chosen then system must block all the write operations active nodes incase of anyone node going down to avoid data inconsistency. Thus system will be unavailable till the impacted node is up and running. For example : In case of banking applications high consistency is one of the major requirements so account information is provided accurately.
If availability over consistency is chosen then system will keep accepting reads even if it returns stale data. System will keep accepting writes at working nodes and data will be synced to node that is down once network partition is resolved.

Single Server Simple key-value store system

Creating key-value store for a single server is easy to understand as you will store the key-value pairs inside hash tables with everything stored will be in memory. Even if accessing data from memory is fast but saving everything in memory will be impossible due to space limitations. Still, for a single server there can be few optimizations done in order to fit more data such as: Instead of storing everything only store frequently used data in memory and rest everything in the disk, you can also compress your data before saving it in memory to save some space. Due to space constraint single server can’t be used for highly scalable distributed systems.

Distributed key-value store system

Distributed key-value store/ distributed hash table distributes key-value pairs across many servers.

Various components to build a key-value store:

Data Replication
Data Partition
Consistency
Failure Handling

Data Replication

In order to achieve high availability data must be replicated over all the required N servers asynchronously, N can be a configured parameter. Nodes in the same data center can fail at the same time due to natural disasters so for better reliability replicas as placed inside distinct data centers and data centers are connected via high speed networks.

Data Partitions

In large scale applications data can not fit inside single server and usually data is split into multiple partitions served across multiple partitions. We need to solve these challenges associated with data partitioning: Minimizing data movement when the number of nodes change and Distributing data evenly across multiple nodes. Good part is consistent hashing can be used to solve both these problems as it provides advantages such as: automatic scaling (Adding and removing servers based on load) and diversness (Number of virtual nodes to be allocated are proportional to the server capacity).

Consistency

As data is present across multiple servers/nodes thus synchronizing data between all the replicas is important. For example we have A2 replica for A1 machine now how can we be certain that both machines contain same data always? When adding a new data inside we must update both the machines but what if one of the fail write operation? As a result data will not be consistent in both the machines. Consistency can be achieved using some techniques such as:
Keeping a local copy at coordinator (proxy between client and nodes) incase update operation fail it can retry it.
Maintaining commit log such as in git, each node machine will record a commit log for each transaction. So before updating entry in machine A, we have to go through commit log first. A separate software will process all the commit logs and recover if an operations fails since we can look at the commit log.

Consistency models define the degree of data consistency.

Some of consistency models are:

Strong Consistency: Any read operation return a value corresponding to the result of the most updated write data item. A user never gets outdated data.

Weak Consistency: When a user performs read operation then may not see the most updated data value.

Eventual Consistency: This is a specific form of weak consistency and when provided enough time all the updates are propagated and all available replicas are consistent.

Among all the consistency models why Strong Consistency is not the ideal?

Strong consistency is achieved by forcing a replica not to accept new reads/writes until every replica has agreed on current write. Strong Consistency is not ideal for highly available systems as it blocks new operations. Most popular databases such as Cassandra and Dynamo adopt eventual consistency. Eventual consistency will allow inconsistent values to enter the system in case of concurrent writes and will force the client to read values to reconcile. Usually reconciliation works with versioning to avoid such conflicts.

Failure Handling

In order to resolve failure one should first be able to detect failures. Using multicasting one gets to know whether a server is down or not but it becomes inefficient when many servers are in the system. So a better solution would be to use decentralized failure detection methods like gossip protocol where each node maintains a node membership list containing the id of all the members and heartbeat counters. Each node will periodically send heartbeat to a set of random nodes and which in turn propagate to another set of nodes. Once node receives heartbeat, membership list is updated to the latest info. If heartbeat has not reached for more than predefined periods then member is considered as offline.

Once the failures have been detected it now time for handling them. If one of the nodes become unavailable due to network failure then other healthy servers will process requests (reads, writes) temporarily. When the server is restored back changes will be pushed back to achieve data consistency. This whole process is known as hinted handoff and is usually used to handle temporary failures. What if replica becomes permanently unavailable? We’ll use an anti-entropy protocol to keep replicas in sync. Anti-entropy involves comparing each piece of data on replicas and updating each replica to the newest version. A Merkle tree is used for inconsistency detection and minimizing amount of data transferred. What if data center goes down? In this case instead of having all the servers in one location it is better to have servers distributed across multiple data centers as earlier mentioned.

Client communicates with key-value store API endpoints : get(key) and put(key,value). Coordinator node acts as a proxy between client and the key value store. Nodes are distributed in a ring using consistent hashing. Data is replicated at multiple nodes. There will not be any single point of failute as every node has same responsibilities. Tasks performed by each node include: Failure detection, conflict resolution, replication, detection and resolution, etc.

Write and Read Paths

Cassandra use-case after a write request is directed to a specific node then write request is persisted inside commit log file. Data is saved inside in memory cache. If the cache is full or reaches its threshold value then data is flushed into sorted string table (SS Table). More info on SS Table

After a read request is directed to a specific node, it first checks if data is present in in-memory cache or not. If present, data is returned to the client.
If data is not present in memory, it will be retrieved from the disk. Bloom filter is used to fetch data from SS Tables efficiently.