Distributed Datastore Basics

Basics:
  1. Distributed DB is divided into different nodes & Data is spread across the nodes. 
  2. There is no Master/Slave config in distributed DB. 
  3. Every machine is given the responsibility of saving part of the data based on some logic.
  4. If some node goes down its fine since some of its data is captured into some other node. 
  5. Primary responsibility of a node is to store all the data. Secondary responsibility of another node(backup) will be to store part of that data. So a primary node will store all the data and several secondary nodes will store parts of the data. In case of loss, data will be retrieved from several secondary nodes.
  6. Consistent hashing concept is used to distribute the data across several nodes.
    1. It will determine the hash of data and spread it across different partition keys. 
  7. Primary datastore will take care of hashing the partition key and finding the respective node for storing it.
  8. In No-SQL datastores like Cassandra there is a replication factor, which is usually a number. Let's say replication factor is 3 then the data will be stored onto 3 different nodes. 
    1. Let's say the backup strategy is to store the data into the next available node. So in that case the data will be stored in the computed node, then next available node, then next available. 
    2. So its a combination of consistent hashing and backing up the data into different nodes. 
  9. To handle more load extra nodes will be added. This will automatically assign the responsibility of the distributing the data across new and existing nodes using Gossip protocol. 
    1. All nodes talk to each other so they everything about each other. 
    2. If a node goes down it will not hear or talk anything. When other will communicate to it there will be no response. They'll understand that the node is down so they will share the responsibility amongst themselves automatically. Another node will become primary node to hold the data of that range let's say 10-20k. 
  10. Similar strategies are implemented in Redis, HDFS, Cassandra, 
    1. HDFS has master node/name node and data node. Data resides in the data node. 
    2. In such cases ZooKeeper can take care of assigning responsibilities to slave nodes if master goes down. 
  11. Commodity computers are like normal desktop that we use at home but with extra capabilities. 

    Why you need to scale?
    Performance is degraded as customers are complaining that requests are taking more time.

    Vertical Scaling: 
    Upgrade the server to handle volume of requests easily. Upgrading single server's capability much more. 

    Horizontal Scaling:
    1. Add more machines to the cluster. 
    2. Cheaper than vertical scaling. 
    3. If service is not heavily used then more machines can be removed from the cluster. Service is not impacted since you are adding more machines to direct the traffic and removing the machines so requests are served by some or the other machine at any given point.
    4. Moreover, we can store the machines in different continents. 
    Conclusion:
    1. Vertical scaling is cost efficient in short term as you are buy less H/W. Horizontal scaling is cost efficient in long run. 
    2. Vertical scaling is not fault tolerant as its only one machine. 

    Advantages of distributed systems: 
    1. Fault tolerance:
      1. If server in a country goes down then requests can be directed to another server in another country.
      2. User will see more latency but the service remains available. 
    2. Low Latency:
      1. You can place the server near to where most requests are originating in case of HS but in VS there is only few servers so users will wait for their response. 
      2. Time taken to send and receive the request will go down. Only time consumed will be the time taken by the service. 

You might also like:

No comments:

Post a Comment

NoSQL

This one is reviewed but I need to delete its copy from hubpages or somewhere NoSQL Data models: key-value  Aggregate model.  key or i...