[Concept] Database sharding and consistent hashing
When a database table starts to grow over gigabytes of data or billions of records, managing the data availability and workload on a single database instance becomes very challenging. For those scenarios, partitioning the table is recommended to make all the data easier to manage and process.
One of the most drastic approaches to partitioning is sharding the database, which splits the data and workload into different database instances, allowing the use of multiple servers and increasing the processing power and storage potential.
Sharding a database brings a lot of different challenges that are often trivial on single instances databases, like data aggregation and filtering, instance selection, hotspot detection and shard management.
This article focus specifically on introducing the concept of distributing data on database shards using consistent hashing.
Hashing
Hashing involves using an algorithm that maps object data to a numerical representation. For database sharding, we have to calculate a hash based on the record ID.
Good algorithms for sharding aims to evenly distributed values, have fast execution and be built in a way that a small input change will drastically change the output. It is not a problem to generate repeated hash values, but for flexibility it is better to have a big range of outputs.
For simplicity sake, the demonstration below generates values from 0 to 999, but for a real-life scenario, a range from 0 to 2^32 would be prefered.
Input | Output |
"key_1" | 22 |
"key_2" | 894 |
"key_3" | 379 |
Dynamic sharding
Each database shard will be responsible for a specific range of the hashing output. The objective is to balance the workload, so at first it is recommended to split the data evenly.
Database instance | Range |
Instance A | [000..249] |
Instance B | [250..499] |
Instance C | [500..749] |
Instance D | [750..999] |
This way, we can determine where the data should be kept based on the hash value.
Input | Hash | Instance |
"key_1" | 22 | Instance A |
"key_2" | 894 | Instance D |
"key_3" | 380 | Instance B |
Adding an instance
On situations where an instance is getting too big, or became a hotspot and cannot handle all requests, it's possible to slice that instance hash range and rellocate data from that instance only, while all others will continue operating as normal. For example, if we wanted to split the instance A in half:
Database instance | Range | Required data reallocation |
Instance A1 | [000..124] | Yes |
Instance A2 | [125..249] | Yes |
Instance B | [250..499] | No |
Instance C | [500..749] | No |
Instance D | [750..999] | No |
Removing an instance
Similarly, if two instances are under-used, it's simple to merge the data into a single instance without affecting the other instances. For exemple, if C and D are under-used, we can elect the instance C to absorb all the range and data of the instance D.
Database instance | Range | Required data reallocation |
Instance A1 | [000..124] | No |
Instance A2 | [125..249] | No |
Instance B | [250..499] | No |
Instance C merged | [500..999] | Yes |
Virtual nodes
It is expected that each shard will have different workloads, even if the hash is splitting the data equally. For example, some records might have more access than others, or the ID going to the hash function is generating biased hashes.
To help rebalance the instances workload without adding or removing, we can bind multiple ranges into a single instance. In this scenario, it is more appropriate to say the hash ranges are assigned to virtual nodes rather than database instances.
Instance | Virtual node | Range |
Instance A | Node 1 | [000..199] |
Instance B | Node 2 | [200..399] |
Instance C | Node 3 | [400..599] |
Instance A | Node 4 | [600..799] |
Instance B | Node 5 | [800..999] |
Why consistent hashing is favored over linear hashing
Linear hashing is a strategy of distributing values on a fixed amount of data buckets. Since each bucket has a single value that represents it, it's a very intuitive and fast strategy.
A very common way of implementing linear hashing is using the modulo operation on the hash value, adopting the number of database instances as the divisor. This strategy also splits the data evenly between all the instances.
Database instance | Maps to mod value |
Instance A | 0 |
Instance B | 1 |
Instance C | 2 |
Input | Hash value | Mod value (mod 3) | Database |
"key_1" | 22 | 1 | Instance B |
"key_2" | 894 | 0 | Instance A |
"key_3" | 380 | 2 | Instance C |
"key_4" | 643 | 1 | Instance B |
However, adding or removing instances will change the modulo divisor, so now any ID can potentially generate a different mod value than the previous function.
To ensure that every record is saved on the database instance corresponding to it's new mod value, it is necessary to recalculate the mod value of every ID inside the database. If the new value is different than the previous one, the record need be moved to another database instance.
Input | Hash value | Old distribution (mod 3) | New distribution (mod 4) | Move data? |
"key_1" | 22 | 1 = Instance B | 2 = Instance C | Yes |
"key_2" | 894 | 0 = Instance A | 2 = Instance C | Yes |
"key_3" | 380 | 2 = Instance C | 0 = Instance A | Yes |
"key_4" | 643 | 1 = Instance B | 3 = Instance D | Yes |
Considering a properly balanced database, changing the divisor requires moving at least 50% entire database, affecting all the database instances, regardless by much the divisor increased or decreased. On the worst cases, the amount of records that would need to be moved can be close to 100% of the entire database.
As exemplified on a scenario with three database instances:
Current instance count | Change | Result instance count | % of records moved |
3 | Remove 2 instances | 1 | 66.6% |
3 | Remove 1 instance | 2 | 66.6% |
3 | Add 1 instance | 4 | 75.0% |
3 | Add 2 instances | 5 | 80.0% |
3 | Add 3 instances | 6 | 50.0% |
3 | Add 4 instances | 7 | 85.7% |
On a real-world scenario, if a database instance A is overloaded, adding a new instance would require moving around records from all the database, not only instance A, which not only is unoptimized, but makes database availability much harder to manage.
Projecting the outcome is also much harder. There's a possibility that after the data migration, one instance end up with most of the high-request records, potentially becoming more overloaded than the previous hotspot.
Dynamic table mitigates most problems
Using a dynamic table allows to rearrange only specific hash ranges. There's much more control on how many records would need to be analyzed and rellocated, and how many databases instances will be affected at a time.
On the scenario where an instance is overloaded, even if the range is badly chosen and doesn't alleviate a hotspot, there's no risk of increasing the workload of the instance.