Technical Interview Questions: Distributed DataStores Basics

Distributed DataStores Basics

Scaling DBs & CAP Theorem:

CAP Theorem:

In distributed datastores the CAP theorem is used. There are different instances of DBs in the cluster and request/response are received/sent from the cluster.
In distributed data stores we have BASE properties, which means Basically Available Softstate Eventual Consistency.
You can't have all 3 properties in your data store. So usually availability & partition tolerance are picked up over consistency.
Such systems have 'Weak' or 'Eventual' consistency. It means that read and write might be inconsistent for sometimes(minutes) till the data is propagated to all the systems in the cluster.

These distributed or NoSQL DBs are used by youtube and similar sites. However they can't be used for banking and financial applications.

The theorem states for:

Consistency: Data is same across the cluster. So Read/Write across any node of the cluster gives us the same data.
Availability: System needs to be highly available even if some nodes within a cluster goes down.

Now if master goes down then the read/write request will go to the slave.

Partition: If the connection between partition has failed then the system still should be up and running.

If partition/connection between M/S goes then if your most recent write might not be read correctly as latest updates are not read.

Strategies to scale Relational DBs:

This strategy is useful to scale RDBMS. Strategies:

Master-Slave:

Features:

One Master, Couple of Slaves.
Write goes to the Master nodes.
Read comes from the Slaves nodes.
Replication from Master to Slave will happen async.
Async will result into inconsistency for the short term:

Ex: User writes to the Master and reads immediately. If data is not written to Slave then user will not be able to read his latest write thus getting stale data.

Sharding:

All DBs are Master
Data in DBs is spread across different servers based on some key and separation like A-I, J-Q, R-Z. Which means data starting from A-I will go to shard-1, J-Q will go to shard 2, and so on.
This ensured we have increased the availability by n times.
Problem:

If most request are between A-I then first DB will have all the requests. Thus we won't get enough load balancing. To solve this problem, we need to divide these requests between A-F, F-I.

To do that you need to take the DB down, divide it and switch it out. Its painful so its not an efficient DB scaling mechanism.

SQL joins might be required to get the data from different shards.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)