How to approach it?
- Ask questions on:
- Features
- How much to scale:
- How much data you need to store in the DBs
- What latency is expected
- How many requests to expect within a second/minute?
- Don't use buzzwords
- Clear and organized thinking:
- Draw all the boxes
- Users
- APIs
- Drive discussions via 80-20 rule, where you should be talking 80% of the time
Basics: Things to think about:
- Features for MVP
- Define APIs
- Think about System Availability
- Think about Latency Performance:
- For background jobs high latency is ok
- For customer facing latency needs to be low.
- Scalability - For more users
- Durability - Data need to be stored in the DB without comprimising
- Class Diagram
- Security & Privacy
- Cost effective
Concepts to know about:
- Vertical v/s Horizontal Scaling
- Add CPU on the Vertical. Adding more hosts is Horizontal.
- CAP Theorem:
- Consistency: Read will give you most recent write.
- Availability: You will get the response back for sure. You might not get most recent writes.
- Partition: Between 2 nodes, N/W packets could get dropped.
- RDBMS: for Consistency
- No-SQL: for Availability
- ACID v/s BASE
- RDBMS: ACID properties; Atomicity, Consistency, Isolation, Durability
- No-SQL: BASE properties: Basically available Soft state Eventual Consistency.
- Partitioning v/s Sharding Data
- Sharding or spilting data based on some values.
- Consistent Hashing
- Optimistic V/s Pessimistic locking
- Optimistic: Don't acquire locks till you are ready to commit the transaction
- Pessimistic: Acquire the lock from the start
- Strong V/s Eventual consistency
- Strong: Reads will always see the latest write.
- RDBMS
- Eventual: Reads will sometimes see the latest write and eventually will see all the writes.
- No-SQL: Both: You can configure to get Strong or Eventual consistency.
- Eventual helps with high availability and delayed consistency.
- RDBMS V/s NoSQL
- Types of NoSQL
- Key Value:
- Wide Column:
- Document Based: XML or Json data
- Graph Based:
- Caching
- Cache data is not shared between nodes.
- Shared between nodes.
- Cannot be source of truth.
- Limited storage
- Datacenter/Hosts/Racks
- Datacenter has Racks and racks have hosts.
- Latency to communicate between Racks and hosts.
- CPU/Memory/Hard drive/Network B/W
- Improve throughput, latency
- Random V/s Sequential write on the disk
- Sequential reads and writes are better.
- Random are Slower
- Http vs Http2 vs Websockets
- Entire web run on Http
- Http2
- Websockets: Bi-directional communication
- TCP/IP Model:
- 4 layers
- Ipv4 vs Ipv6
- Ipv4: 32 bit address
- Ipv6: 128 bit address
- How does IP routing works
- TCP vs UDP
- TCP: Connection oriented reliable connection. Sending documents.
- UDP: Unreliable connection. Video Streaming. Its ok to lose some data packets in video streaming. Losing some data packets doesn't lose the consistency.
- DNS Lookup.
- Https v/s TLS(Transport Level Security)
- Secure communication between client and server in terms of data integrity and privacy.
- Public key infrastructure & certificate authority
- Symmetric v/s Asymetric key
- Asymetric: Computational more expensive. Public-Private Key encryption.
- Symmetric: AES
- Loadbalancer -> L4 V/s L7
- Load is distributed based on Round robin basis or average load on the node behind the LB.
- L4
- L7
- CDNs & Edge
- CDN: Content Delivery Network. Helps with:
- Lower latency
- Performance
- Edge
- Bloom Filters & Count-min Sketch
- Paxos: Consensus over distributed hosts.
- Design patterns & Object oriented design
- Virtual machines & containers
- Publisher-Subscriber over a Queue
- Customer Facing request shouldn't be exposed over pub-sub.
- MapReduce
- Distributed and parallel processing of massive amounts of data.
- Multithreading, concurrency, locks, synchronization, CAS
Actual implementation of some of these concepts:
- Cassandra:
- Wide column highly scalable DB. use cases:
- Key-value storage
- Time Series data
- Storing more traditional rows with a lot of columns.
- Can provide eventual and strong consistency.
- Uses consistent hashing to shard the data.
- Uses gossip protocol to make the nodes aware of other nodes status.
- MongoDB/Couchbase:
- Json like structure can be persisted in it.
- scale well.
- MySQL
- Traditional DBs.
- Master-Slave architecture for scalability.
- Memcached
- Distributed cache
- Simple fast key-value storage
- Redis
- Setup as cluster
- Flush data to hard-drive if you configure it that way.
- All properties of Memcached in it.
- ZooKeeper
- Central configuration management tool.
- Used for leader election and distributed locking.
- Scales well for reads not writes
- in-memory so can't store a lot of data.
- Kafka
- Fault tolerance highly available queue used in pub-sub or streaming application.
- deliver message exactly once. Keep the messages in proper order in the assigned partition.
- NGINX/HAProxy
- Load balancers.
- Manager 1000s of connection from clients with a single instance.
- Solr, Elastic Search
- Search platform on top of
- Highly available, scalable, fault tolerant search funtionality
- full search text
- BlobStore like Amazon S3(part of AWS platform)
- Stores big picture/files on cloud
- Docker -> Kubernets/Mesos(Software tools to manage containers)
- Software platform that provides the containers on DC or laptop or cloud.
- Hadoop/Spark -> HDFS
- Hadoop: MapReduce
- Spark: Faster Hadoop as its in-memory mapreduce.
- HDFS: Java based file system that is distributed and fault tolerant and Hadoop relies on it for all processing.