Skip to content

@tank/system-design

1.1.0

Practical system design for distributed systems. Scalability, load balancing, sharding, replication, caching, messaging (Kafka, CQRS, saga), reliability (circuit breakers, rate limiting), service architecture (microservices, API gateway), capacity planning and SLOs. Triggers: system design, distributed systems, scalability, load balancing, sharding, caching, message queue, Kafka, circuit breaker, rate limiting, microservices, API gateway, CAP theorem, CQRS, saga, SLO, capacity planning.


name: "@tank/system-design" description: | Practical system design for production distributed systems. Covers scalability patterns (load balancing, horizontal scaling, CDN, auto-scaling), data layer design (database selection, replication, sharding, consistency models, CAP theorem applied), caching strategies (cache-aside, write-through, invalidation, stampede prevention), messaging and async patterns (queues vs streams, event-driven architecture, CQRS, saga pattern, delivery guarantees), reliability (circuit breakers, bulkheads, retries, rate limiting, timeouts, chaos engineering), service architecture (monolith vs microservices, API gateway, service mesh, distributed transactions), and capacity planning (back-of-envelope estimation, SLOs/SLIs, monitoring, distributed tracing).

Synthesizes Kleppmann (Designing Data-Intensive Applications), Vitillo (Understanding Distributed Systems), Newman (Building Microservices), Ford et al. (Software Architecture: The Hard Parts), Nygard (Release It!), Petrov (Database Internals), Richards & Ford (Fundamentals of Software Architecture), and Beyer et al. (Site Reliability Engineering).

Trigger phrases: "system design", "distributed systems", "scalability", "load balancing", "horizontal scaling", "vertical scaling", "database sharding", "database replication", "caching strategy", "cache invalidation", "message queue", "event-driven", "Kafka", "RabbitMQ", "pub/sub", "circuit breaker", "rate limiting", "bulkhead", "retry strategy", "microservices", "monolith", "API gateway", "service mesh", "CAP theorem", "eventual consistency", "strong consistency", "CQRS", "event sourcing", "saga pattern", "back-of-envelope", "SLO", "SLI", "capacity planning", "distributed tracing", "back pressure", "cache stampede", "thundering herd", "how should I scale", "which database", "when to use microservices"

Practical System Design

Core Philosophy

  1. Every decision is a trade-off. There are no best solutions, only context-appropriate ones. Articulate what you gain AND what you give up.
  2. Measure before designing. Base architectural decisions on observed load, latency, and failure data — not hypothetical future scale.
  3. Start with a monolith, earn microservices. Premature distribution adds complexity without proven benefit. Decompose when evidence demands it.
  4. Design for failure, not just success. Every network call can fail, every dependency can slow down. The question is how your system behaves when things go wrong.
  5. Understand your data. Read/write ratio, access patterns, consistency requirements, and growth rate drive most architectural choices.

Quick-Start: Common Problems

"How should I scale this?"

  1. Identify the bottleneck → Is it CPU, memory, I/O, network, or a downstream dependency?
  2. Can you scale vertically (bigger machine)? → Cheaper and simpler if it works
  3. Is the bottleneck stateless? → Horizontal scaling behind a load balancer
  4. Is the bottleneck the database? → Read replicas for read-heavy, sharding for write-heavy
  5. Is there a hot path? → Cache it (see caching decision below) → See references/scalability-patterns.md

"Which database should I use?"

  1. What are the access patterns? → Key-value lookups, complex queries, graph traversals, time-series?
  2. What consistency do you need? → Strong (financial), eventual (social feed), causal (collaboration)?
  3. What's the read/write ratio? → Read-heavy favors replicas + cache, write-heavy favors LSM-based stores
  4. Will you need joins across entities? → Relational. If not, document or key-value may fit. → See references/data-layer.md for the full selection matrix

"Should I use microservices?"

  1. Can separate teams deploy independently? → Key signal for decomposition
  2. Do components need different scaling profiles? → Another strong signal
  3. Is the domain well-understood? → If not, monolith first — wrong boundaries are expensive to fix
  4. Team smaller than ~50 engineers? → Modular monolith likely sufficient → See references/service-architecture.md

"My system keeps failing under load"

  1. Are there missing timeouts? → Add timeouts to every external call
  2. Is one failing dependency taking everything down? → Circuit breakers + bulkheads
  3. Are retries amplifying the problem? → Add jitter, set retry budgets
  4. Is the queue unbounded? → Add back-pressure, set queue depth limits → See references/reliability-patterns.md

Decision Trees

Architecture Style Selection

Team SizeDomain ClarityDeploy Independence NeededRecommendation
< 20 engineersLowNoMonolith
< 20 engineersHighPartialModular monolith
20-50 engineersHighYesSelective microservices
50+ engineersHighYesMicroservices
AnyUncertainAnyMonolith first, decompose later

Communication Pattern Selection

NeedPatternProtocol
Synchronous request-response, low latencyDirect callREST or gRPC
Fire-and-forget, decoupledMessage queueRabbitMQ, SQS
Event broadcast to many consumersPub/sub streamKafka, SNS
Long-running workflow coordinationOrchestrated sagaTemporal, Step Functions
High-throughput data pipelineEvent streamKafka, Kinesis

Caching Strategy Selection

ScenarioPatternWhy
Read-heavy, tolerant of stale dataCache-aside with TTLSimple, widely applicable
Reads must reflect recent writesWrite-throughConsistency at cost of write latency
Write-heavy, reads can lagWrite-behindFast writes, async persistence
Predictable access patternsRefresh-aheadAvoids cache miss latency
Never cacheHighly dynamic, user-specific, security-sensitiveStale data cost > latency cost

Database Type Selection

Access PatternBest FitExample
Structured data, complex queries, transactionsRelational (PostgreSQL, MySQL)E-commerce orders, financial records
Flexible schema, document-oriented readsDocument (MongoDB)Content management, user profiles
Simple key-value lookups, extreme throughputKey-Value (Redis, DynamoDB)Sessions, feature flags, counters
Wide rows, high write throughputWide-Column (Cassandra, ScyllaDB)IoT telemetry, activity logs
Highly connected data, traversalsGraph (Neo4j)Social networks, recommendations
Time-ordered data, aggregationsTime-Series (TimescaleDB, InfluxDB)Metrics, monitoring, analytics

Anti-Patterns Quick Reference

Anti-PatternProblemFix
Distributed monolithMicroservices that must deploy togetherEnforce data ownership, async communication
Premature microservicesComplexity before clarityStart monolith, decompose with evidence
Unbounded queuesOOM under load, cascading backlogSet queue limits, implement back-pressure
Missing timeoutsOne slow dependency blocks all threadsTimeout every external call, propagate deadlines
Retry stormsRetries amplify load on failing serviceExponential backoff + jitter + retry budgets
Cache as source of truthData loss when cache evictsCache is acceleration, database is truth
Shared mutable stateContention, inconsistency, scaling wallExternalize state, share-nothing design
Over-engineering for scaleBuilding for 10M users when you have 1KMeasure actual load, scale when needed

Reference Files

FileContents
references/scalability-patterns.mdScaling dimensions (X/Y/Z axes), load balancing (L4 vs L7, algorithms), auto-scaling strategies, CDN architecture, stateless design, database scaling overview
references/data-layer.mdDatabase type selection matrix, replication strategies, partitioning/sharding, consistency models, CAP/PACELC applied, SQL vs NoSQL framework, data modeling for scale
references/caching-strategies.mdCache placement, caching patterns (aside/through/behind/ahead), distributed cache architecture, invalidation, eviction policies, failure modes (stampede/penetration/avalanche)
references/messaging-and-async.mdSync vs async decision, queue fundamentals, broker selection (Kafka/RabbitMQ/SQS), event-driven architecture, CQRS, saga pattern, back-pressure, delivery guarantees
references/reliability-patterns.mdFailure modes, circuit breakers, retry strategies, bulkheads, rate limiting, timeouts/deadlines, health checks, failover, chaos engineering
references/service-architecture.mdArchitecture style selection, service decomposition, API gateway, communication patterns (REST/gRPC/GraphQL), service discovery, distributed transactions, service mesh
references/capacity-and-observability.mdBack-of-envelope estimation, latency reference numbers, SLOs/SLIs/SLAs, bottleneck analysis, monitoring strategy (RED/USE), distributed tracing, alerting design, capacity planning

Command Palette

Search skills, docs, and navigate Tank