List of things to read/know for system design


How to approach it? 
  1. Ask questions on:
    1. Features
    2. How much to scale:
      1. How much data you need to store in the DBs
      2. What latency is expected
      3. How many requests to expect within a second/minute?
  2. Don't use buzzwords
  3. Clear and organized thinking: 
    1. Draw all the boxes
    2. Users
    3. APIs
  4. Drive discussions via 80-20 rule, where you should be talking 80% of the time 
Basics: Things to think about: 
  1. Features for MVP
  2. Define APIs 
  3. Think about System Availability
  4. Think about Latency Performance:
    1. For background jobs high latency is ok
    2. For customer facing latency needs to be low. 
  5. Scalability - For more users
  6. Durability - Data need to be stored in the DB without comprimising 
  7. Class Diagram
  8. Security & Privacy 
  9. Cost effective 
Concepts to know about:
  1. Vertical v/s Horizontal Scaling
    1. Add CPU on the Vertical. Adding more hosts is Horizontal. 
  2. CAP Theorem: 
    1. Consistency: Read will give you most recent write. 
    2. Availability: You will get the response back for sure. You might not get most recent writes. 
    3. Partition: Between 2 nodes, N/W packets could get dropped. 
    4. RDBMS: for Consistency
    5. No-SQL: for Availability 
  3. ACID v/s BASE
    1. RDBMS: ACID properties; Atomicity, Consistency, Isolation, Durability
    2. No-SQL: BASE properties: Basically available Soft state Eventual Consistency. 
  4. Partitioning v/s Sharding Data
    1. Sharding or spilting data based on some values. 
  5. Consistent Hashing
  6. Optimistic V/s Pessimistic locking
    1. Optimistic: Don't acquire locks till you are ready to commit the transaction
    2. Pessimistic: Acquire the lock from the start
  7. Strong V/s Eventual consistency
    1. Strong: Reads will always see the latest write. 
      1. RDBMS
    2. Eventual: Reads will sometimes see the latest write and eventually will see all the writes.
      1. No-SQL: Both: You can configure to get Strong or Eventual consistency. 
      2. Eventual helps with high availability and delayed consistency. 
  8. RDBMS V/s NoSQL
  9. Types of NoSQL
    1. Key Value: 
    2. Wide Column: 
    3. Document Based: XML or Json data
    4. Graph Based: 
  10. Caching
    1. Cache data is not shared between nodes. 
    2. Shared between nodes. 
    3. Cannot be source of truth. 
    4. Limited storage
  11. Datacenter/Hosts/Racks
    1. Datacenter has Racks and racks have hosts. 
    2. Latency to communicate between Racks and hosts.
  12. CPU/Memory/Hard drive/Network B/W
    1. Improve throughput, latency
  13. Random V/s Sequential write on the disk
    1. Sequential reads and writes are better. 
    2. Random are Slower
  14. Http vs Http2 vs Websockets
    1. Entire web run on Http
    2. Http2 
    3. Websockets: Bi-directional communication
  15. TCP/IP Model:
    1. 4 layers
  16. Ipv4 vs Ipv6
    1. Ipv4: 32 bit address
    2. Ipv6: 128 bit address
    3. How does IP routing works
  17. TCP vs UDP
    1. TCP: Connection oriented reliable connection. Sending documents. 
    2. UDP: Unreliable connection. Video Streaming. Its ok to lose some data packets in video streaming. Losing some data packets doesn't lose the consistency. 
  18. DNS Lookup. 
  19. Https v/s TLS(Transport Level Security)
    1. Secure communication between client and server in terms of data integrity and privacy. 
  20. Public key infrastructure & certificate authority
  21. Symmetric v/s Asymetric key
    1. Asymetric: Computational more expensive. Public-Private Key encryption. 
    2. Symmetric: AES
  22. Loadbalancer -> L4 V/s L7
    1. Load is distributed based on Round robin basis or average load on the node behind the LB.
    2. L4
    3. L7
  23. CDNs & Edge
    1. CDN: Content Delivery Network. Helps with:
      1. Lower latency
      2. Performance
    2. Edge 
  24. Bloom Filters & Count-min Sketch
  25. Paxos: Consensus over distributed hosts. 
  26. Design patterns & Object oriented design
  27. Virtual machines & containers
  28. Publisher-Subscriber over a Queue
    1. Customer Facing request shouldn't be exposed over pub-sub. 
  29. MapReduce
    1. Distributed and parallel processing of massive amounts of data. 
  30. Multithreading, concurrency, locks, synchronization, CAS
Actual implementation of some of these concepts:
  1. Cassandra: 
    1. Wide column highly scalable DB. use cases:
      1. Key-value storage
      2. Time Series data
      3. Storing more traditional rows with a lot of columns. 
    2. Can provide eventual and strong consistency. 
    3. Uses consistent hashing to shard the data. 
    4. Uses gossip protocol to make the nodes aware of other nodes status. 
  2. MongoDB/Couchbase:
    1. Json like structure can be persisted in it. 
    2. scale well.
  3. MySQL
    1. Traditional DBs. 
    2. Master-Slave architecture for scalability. 
  4. Memcached
    1. Distributed cache
    2. Simple fast key-value storage
  5. Redis
    1. Setup as cluster
    2. Flush data to hard-drive if you configure it that way. 
    3. All properties of Memcached in it. 
  6. ZooKeeper
    1. Central configuration management tool. 
    2. Used for leader election and distributed locking. 
    3. Scales well for reads not writes
    4. in-memory so can't store a lot of data. 
  7. Kafka
    1. Fault tolerance highly available queue used in pub-sub or streaming application. 
    2. deliver message exactly once. Keep the messages in proper order in the assigned partition. 
  8. NGINX/HAProxy
    1. Load balancers. 
    2. Manager 1000s of connection from clients with a single instance. 
  9. Solr, Elastic Search
    1. Search platform on top of 
    2. Highly available, scalable, fault tolerant search funtionality
    3. full search text
  10. BlobStore like Amazon S3(part of AWS platform)
    1. Stores big picture/files on cloud
  11. Docker -> Kubernets/Mesos(Software tools to manage containers)
    1. Software platform that provides the containers on DC or laptop or cloud. 
  12. Hadoop/Spark -> HDFS
    1. Hadoop: MapReduce
    2. Spark: Faster Hadoop as its in-memory mapreduce. 
    3. HDFS: Java based file system that is distributed and fault tolerant and Hadoop relies on it for all processing. 


No comments:

Post a Comment

NoSQL

This one is reviewed but I need to delete its copy from hubpages or somewhere NoSQL Data models: key-value  Aggregate model.  key or i...