Practical System Design

Name: @tank/system-design
Author: Elad Ben Haim

Core Philosophy

Every decision is a trade-off. There are no best solutions, only context-appropriate ones. Articulate what you gain AND what you give up.
Measure before designing. Base architectural decisions on observed load, latency, and failure data — not hypothetical future scale.
Start with a monolith, earn microservices. Premature distribution adds complexity without proven benefit. Decompose when evidence demands it.
Design for failure, not just success. Every network call can fail, every dependency can slow down. The question is how your system behaves when things go wrong.
Understand your data. Read/write ratio, access patterns, consistency requirements, and growth rate drive most architectural choices.

Quick-Start: Common Problems

"How should I scale this?"

Identify the bottleneck → Is it CPU, memory, I/O, network, or a downstream dependency?
Can you scale vertically (bigger machine)? → Cheaper and simpler if it works
Is the bottleneck stateless? → Horizontal scaling behind a load balancer
Is the bottleneck the database? → Read replicas for read-heavy, sharding for write-heavy
Is there a hot path? → Cache it (see caching decision below) → See references/scalability-patterns.md

"Which database should I use?"

What are the access patterns? → Key-value lookups, complex queries, graph traversals, time-series?
What consistency do you need? → Strong (financial), eventual (social feed), causal (collaboration)?
What's the read/write ratio? → Read-heavy favors replicas + cache, write-heavy favors LSM-based stores
Will you need joins across entities? → Relational. If not, document or key-value may fit. → See references/data-layer.md for the full selection matrix

"Should I use microservices?"

Can separate teams deploy independently? → Key signal for decomposition
Do components need different scaling profiles? → Another strong signal
Is the domain well-understood? → If not, monolith first — wrong boundaries are expensive to fix
Team smaller than ~50 engineers? → Modular monolith likely sufficient → See references/service-architecture.md

"My system keeps failing under load"

Are there missing timeouts? → Add timeouts to every external call
Is one failing dependency taking everything down? → Circuit breakers + bulkheads
Are retries amplifying the problem? → Add jitter, set retry budgets
Is the queue unbounded? → Add back-pressure, set queue depth limits → See references/reliability-patterns.md

Decision Trees

Architecture Style Selection

Team Size	Domain Clarity	Deploy Independence Needed	Recommendation
< 20 engineers	Low	No	Monolith
< 20 engineers	High	Partial	Modular monolith
20-50 engineers	High	Yes	Selective microservices
50+ engineers	High	Yes	Microservices
Any	Uncertain	Any	Monolith first, decompose later

Communication Pattern Selection

Need	Pattern	Protocol
Synchronous request-response, low latency	Direct call	REST or gRPC
Fire-and-forget, decoupled	Message queue	RabbitMQ, SQS
Event broadcast to many consumers	Pub/sub stream	Kafka, SNS
Long-running workflow coordination	Orchestrated saga	Temporal, Step Functions
High-throughput data pipeline	Event stream	Kafka, Kinesis

Caching Strategy Selection

Scenario	Pattern	Why
Read-heavy, tolerant of stale data	Cache-aside with TTL	Simple, widely applicable
Reads must reflect recent writes	Write-through	Consistency at cost of write latency
Write-heavy, reads can lag	Write-behind	Fast writes, async persistence
Predictable access patterns	Refresh-ahead	Avoids cache miss latency
Never cache	Highly dynamic, user-specific, security-sensitive	Stale data cost > latency cost

Database Type Selection

Access Pattern	Best Fit	Example
Structured data, complex queries, transactions	Relational (PostgreSQL, MySQL)	E-commerce orders, financial records
Flexible schema, document-oriented reads	Document (MongoDB)	Content management, user profiles
Simple key-value lookups, extreme throughput	Key-Value (Redis, DynamoDB)	Sessions, feature flags, counters
Wide rows, high write throughput	Wide-Column (Cassandra, ScyllaDB)	IoT telemetry, activity logs
Highly connected data, traversals	Graph (Neo4j)	Social networks, recommendations
Time-ordered data, aggregations	Time-Series (TimescaleDB, InfluxDB)	Metrics, monitoring, analytics

Anti-Patterns Quick Reference

Anti-Pattern	Problem	Fix
Distributed monolith	Microservices that must deploy together	Enforce data ownership, async communication
Premature microservices	Complexity before clarity	Start monolith, decompose with evidence
Unbounded queues	OOM under load, cascading backlog	Set queue limits, implement back-pressure
Missing timeouts	One slow dependency blocks all threads	Timeout every external call, propagate deadlines
Retry storms	Retries amplify load on failing service	Exponential backoff + jitter + retry budgets
Cache as source of truth	Data loss when cache evicts	Cache is acceleration, database is truth
Shared mutable state	Contention, inconsistency, scaling wall	Externalize state, share-nothing design
Over-engineering for scale	Building for 10M users when you have 1K	Measure actual load, scale when needed

Reference Files

File	Contents
`references/scalability-patterns.md`	Scaling dimensions (X/Y/Z axes), load balancing (L4 vs L7, algorithms), auto-scaling strategies, CDN architecture, stateless design, database scaling overview
`references/data-layer.md`	Database type selection matrix, replication strategies, partitioning/sharding, consistency models, CAP/PACELC applied, SQL vs NoSQL framework, data modeling for scale
`references/caching-strategies.md`	Cache placement, caching patterns (aside/through/behind/ahead), distributed cache architecture, invalidation, eviction policies, failure modes (stampede/penetration/avalanche)
`references/messaging-and-async.md`	Sync vs async decision, queue fundamentals, broker selection (Kafka/RabbitMQ/SQS), event-driven architecture, CQRS, saga pattern, back-pressure, delivery guarantees
`references/reliability-patterns.md`	Failure modes, circuit breakers, retry strategies, bulkheads, rate limiting, timeouts/deadlines, health checks, failover, chaos engineering
`references/service-architecture.md`	Architecture style selection, service decomposition, API gateway, communication patterns (REST/gRPC/GraphQL), service discovery, distributed transactions, service mesh
`references/capacity-and-observability.md`	Back-of-envelope estimation, latency reference numbers, SLOs/SLIs/SLAs, bottleneck analysis, monitoring strategy (RED/USE), distributed tracing, alerting design, capacity planning

@tank/system-design

Description

Triggered by

Practical System Design

Core Philosophy

Quick-Start: Common Problems

"How should I scale this?"

"Which database should I use?"

"Should I use microservices?"

"My system keeps failing under load"

Decision Trees

Architecture Style Selection

Communication Pattern Selection

Caching Strategy Selection

Database Type Selection

Anti-Patterns Quick Reference

Reference Files

@tank/system-design

Description

Triggered by

Practical System Design

Core Philosophy

Quick-Start: Common Problems

"How should I scale this?"

"Which database should I use?"

"Should I use microservices?"

"My system keeps failing under load"

Decision Trees

Architecture Style Selection

Communication Pattern Selection

Caching Strategy Selection

Database Type Selection

Anti-Patterns Quick Reference

Reference Files

Command Palette