Partitioning

Partitioning

Partitioning and Replication are usually combined. When doing so, each node is a leader for some partitions and a follower for others. Partitions don't have to be replicated on every node so that node 1 could contain partitions A, B, C and node 2 partitions B, D, E, etc...

Partitions are usually a subset of a data set, it's still a database on its own.

How to dispatch the shards equally then so that some nodes don't have to deal with more queries than others and the load is then equally distributed?

Partitioning of key-value stores

By key range

Just like your encyclopedia.

With encyclopedia you search by key, same as in key-value stores.

  • Allows efficient range queries.
  • Load is not necessarily evenly distributed

For time series, such as sensors collection, prefix the timestamp key by the sensor's name.

By hash of key

Fairly common in distributed systems.

The key is non-cryptographically hashed (with MD5 or Fowler-Noll-Vo).

  • This ensures entries are evenly distributed (and so does load)
  • Does not permit range queries, and all nodes have to be queried when querying by primary key.

Partitioning with Secondary Indexes

Such as all cars whose colour is red. Index is color:red.

By document

Each partition maintains its own secondary index, covering only the documents in that partition.

You must then query all partitions, wait for them all to return in order to have a consistent response.

For the secondary index color,

  • Partition 0 could have ids 123, 245 and 657 with color:red, ids 23 and 345 with color:black
  • Partition 1 could have ids 745, 888 with color:red and no color:black.

considering Partition 0 keeps track of ids [0, 700] and Partition 1 ids [700, 1000].

This is called local indexing.

By term

The indexes are now held globally, not per partition. However, it is itself also partitioned.

Hence, color will be distributed so that colours from a to c would end up in Partition 0, and from d to m in Partition 1, etc. They would still keep track of document ids whatever the partition they are actually attached to. So an index in Partition 0 can contain a document which is in Partition 1.

This is called global indexing.

Rebalancing

When adding / removing / operating on nodes.

Make sure you have multiple shards / partitions on each node.

Request routing

Options:

* Connect to any node, if it's the one which has the requested data, serve it, otherwise forward to the other node which has it.
* Connect to a load balancer which only role is to dispatch
* Require from client to be aware of the partitions assignments to nodes

Many rely on separate coordination service such as Zookeeper:

Nodes self-register with Zookeeper and the routing tier or the client can subscribe to changes.

results matching ""

    No results matching ""