System DesignDistributed Systems
June 11, 2026

System Design Fundamentals for Scalable Backend Services

A good system design is not a perfect diagram. It is a set of tradeoffs that fits the requirements, constraints, and failure cases of the problem in front of you. The practical goal is to break a large problem into smaller modules, explain how those modules work together, and identify where the system will bottleneck as traffic, data size, or reliability requirements increase.

For a junior developer, the most useful way to approach system design is top-down. Start with what the system must do, ask what constraints matter, sketch the major components, then improve the design where the bottlenecks appear. This article walks through the building blocks that show up again and again in back-end systems: load balancers, databases, replication, caches, content delivery, sharding, indexes, proxies, queues, consistent hashing, and client-server communication protocols.

The Problem

Imagine you are designing a back-end service that accepts user requests, reads and writes data, serves static media, and runs slow work in the background. At small scale, one server and one database might be enough. At larger scale, the same design starts to fail in predictable ways:

  • One server cannot handle all incoming requests.
  • One database becomes slow because every read and write goes to the same place.
  • Slow work blocks user-facing requests.
  • Static files waste application server capacity.
  • A single machine failure takes the whole product down.
  • Real-time features create too many repeated client requests.

A scalable design separates these concerns into components. Each component solves one class of problem, but also introduces a tradeoff.

Client
  |
  v
Load Balancer
  |
  v
Web Servers
  |
  v
Load Balancer
  |
  v
Application Servers / Cache Layer
  |
  +--> Queue --> Workers
  |
  +--> Cache
  |
  v
Load Balancer
  |
  v
Databases / Replicas / Shards

This diagram is not the final design for every system. It is a practical starting point. The design becomes useful when you can explain why each box exists, how requests flow through it, and what happens when a box is slow or unavailable.

Start by Scoping the System

Before choosing technologies, clarify the scope. In a design interview or an engineering design review, this step matters more than naming tools.

Ask questions like:

  1. Who are the users or clients of the system?
  2. What are the main actions: reads, writes, uploads, searches, notifications, real-time updates?
  3. Which data must be strongly correct, and which data can be slightly stale?
  4. What is the expected traffic pattern?
  5. Are reads more common than writes?
  6. What happens when a dependency fails?
  7. Which operations must return immediately, and which can run asynchronously?

Then identify actors, inputs, outputs, and constraints.

Area Practical question
Actors Who sends requests and who consumes the results?
Inputs What data enters the system, and how large is it?
Outputs What response, event, file, or database change should happen?
Constraints What matters most: latency, consistency, availability, cost, or simplicity?
Bottlenecks Which component is most likely to become slow first?

This top-down approach keeps the design focused. It also prevents a common mistake: adding advanced distributed system patterns before the basic requirements are clear.

Load Balancing: Spread Requests Across Workers

Load balancing distributes requests across multiple machines. It improves scalability because more machines can serve traffic, and it improves redundancy because one machine can fail without stopping the entire service.

Common distribution strategies include:

  • Random: choose a back-end randomly.
  • Round-robin: cycle through backends in order.
  • Weighted random: choose a back-end using weights, such as available memory or CPU capacity.

In a larger system, load balancing can appear at multiple layers:

  1. Between users and web servers.
  2. Between web servers and application or cache servers.
  3. Between internal services and databases.
Client
  |
  v
LB 1
  |
  +--> Web Server A
  +--> Web Server B
          |
          v
        LB 2
          |
          +--> App Server A
          +--> App Server B
                  |
                  v
                LB 3
                  |
                  +--> Database A
                  +--> Database B

Smart Clients

A smart client keeps a pool of service hosts and balances requests itself. It can detect hosts that are not responding, use recovered hosts again, and include newly added hosts. This can be attractive in small systems because developers do not need a separate load balancer for every internal service.

The tradeoff is complexity. As the system grows, putting load balancing behavior inside many clients becomes harder to manage consistently.

Hardware Load Balancers

Hardware load balancers can provide high performance, but they are expensive and not always easy to configure. Large systems may use hardware load balancers as the first contact point for external traffic, while using smart clients or software load balancers inside the private network.

Software Load Balancers

Software load balancers avoid the cost of dedicated hardware and avoid forcing every developer to build smart client logic. A common pattern is to run a software proxy either on the client machine or on an intermediate server.

For example, a local proxy can expose a local port such as localhost:9000, then route requests to healthy back-end machines. An intermediate proxy can sit between clients and server-side components. In both cases, the proxy can manage health checks, add or remove machines, and balance requests across pools.

Storage Choice: Relational or Non-Relational

Storage design is one of the biggest system design decisions. The right choice depends on structure, scale, consistency needs, and how quickly the data model changes.

Relational databases store structured data with a predefined schema. Data is organized into rows and columns, where a row represents one entity and columns represent separate data points. Examples include MySQL, Oracle, SQL Server, SQLite, Postgres, and MariaDB.

Non-relational databases are commonly used when data is less structured, distributed across many machines, or changes shape frequently. They often support dynamic schemas and different storage models.

Storage model How to think about it Examples from common system design discussions
Key-value store Data is stored as key-value pairs. The key points to the value. Redis, Voldemort, Dynamo
Document database Data is stored as documents inside collections. Documents in the same collection do not all need the same fields. CouchDB, MongoDB
Wide-column database Data is grouped into column families and is useful when analyzing large datasets by columns. Cassandra, HBase
Graph database Data is represented as nodes and relationships, useful when connections are central to the model. Neo4j, InfiniteGraph

When a Relational Database Fits

A relational database is usually a strong fit when the system needs ACID compliance. ACID means the database protects correctness around transactions: atomicity, consistency, isolation, and durability. This reduces anomalies and protects database integrity.

Financial systems and many commerce systems often need this kind of protection. If data is structured and not changing rapidly, and the business does not need to scale across many servers immediately, a relational database can be the simpler and safer first choice.

When a Non-Relational Database Fits

A non-relational database can fit when querying and searching become bottlenecks after the rest of the system is fast. It is also useful for large volumes of data with little or no structure, because there may be no fixed limit on the shape of stored data.

Non-relational databases are also useful when the system needs to spread data across multiple servers or data centers. This can provide cost savings when using commodity hardware or cloud storage. They are also convenient during rapid development because quick schema changes can slow teams down when every change must be reflected in a fixed relational schema.

The tradeoff is that scalability and flexibility can come with weaker guarantees or more work in application code. The storage choice should follow the consistency and access requirements, not trends.

CAP Theorem: Know Which Guarantee You Are Trading Away

Distributed storage systems are affected by the CAP theorem. The three properties are:

  • Consistency: all nodes see the same data at the same time.
  • Availability: every request receives a response, even if the response is a success or failure.
  • Partition tolerance: the system continues operating despite message loss or partial network failure.

A distributed data-store cannot fully guarantee all three at the same time. The hard case is a network partition. To stay consistent, all nodes need to see the same updates in the same order. During a partition, an update in one partition may not reach another partition. A client could read from an up-to-date partition, then later read stale data from an out-of-date partition.

One solution is to stop serving requests from the out-of-date partition. That protects consistency, but the service is no longer fully available. Another solution is to keep serving requests, accepting the risk that some reads may be stale. That protects availability, but weakens consistency.

This tradeoff should be explicit. Do not just say the system is reliable. Say what happens during a partition.

Redundancy and Replication: Remove Single Points of Failure

Redundancy means duplicating critical services or data so one failure does not take everything down. Replication means keeping multiple copies of data across servers or databases.

For critical services and data, run multiple copies or versions at the same time on different machines. This gives the system a path for failover and provides backups during a crisis.

Primary Server
  |
  | failover
  v
Secondary Server

Active Data
  |
  | replication
  v
Mirrored Data

A useful service design is shared-nothing architecture. In this style, every node is independent and no central service manages all state. This reduces single points of failure, makes the system more resilient, and allows new servers to be added with fewer special conditions.

Replication is not free. More copies can mean more coordination, more storage, and more consistency decisions. Still, for important data and services, having only one copy is usually the larger risk.

Caching: Save Work on Hot Paths

Caching stores frequently used data closer to the request path. It relies on locality of reference: recently or frequently accessed data is likely to be accessed again.

Caching is used in almost every layer of computing. In back-end systems, it commonly appears in application servers, global cache layers, distributed caches, and content delivery networks.

Application Server Cache

An application server cache places cached responses directly on a request-layer node. The cache can live in memory, which is very fast, or on the node's local disk, which is still often faster than going to remote storage.

Request
  |
  v
Request Layer Node
  |
  +--> local cache hit -> return cached response
  |
  +--> cache miss -> fetch from origin -> cache response -> return response

The bottleneck appears when a load balancer distributes requests randomly. The same request may land on different nodes, causing more cache misses. Two common ways to handle this are global caches and distributed caches.

Global Cache

A global cache provides one cache space for all request nodes. It can sit between application servers and the database. This is useful when there is a fixed dataset that needs to be cached or when special hardware provides fast input and output.

There are two common forms:

  1. The cache manages fetching data from the database.
  2. The application manages cache behavior because it understands eviction decisions better than the cache alone.

A global cache can simplify reads, but it can also become difficult to manage as the number of clients and requests grows.

Distributed Cache

A distributed cache divides cached data across multiple nodes, often using consistent hashing. Adding more cache nodes increases cache space.

Query key
  |
  v
Hash function
  |
  +--> Cache Node 1
  +--> Cache Node 2
  +--> Cache Node 3

The advantage is scalability. The disadvantage is complexity. If a node disappears, the request can still pull data from the origin, but storing multiple copies across nodes can make the design more complicated.

Cache Invalidation

Cached data must stay coherent with the database. When the database changes, the old cached value should be invalidated or replaced. Three common write patterns are useful to compare.

Pattern How it works Strength Risk
Write-through Write to cache and database at the same time. Strong consistency between cache and database. Higher write latency because two writes happen.
Write-around Write directly to the database and skip the cache. Avoids filling cache with data that may not be read. First read after a write misses the cache.
Write-back Write to cache first, then flush to database later. Low write latency and high throughput for write-heavy systems. Data loss risk if the cache is the only copy before flush.

Cache eviction policies decide what to remove when the cache is full. Common policies include FIFO, LIFO or FILO, LRU, MRU, LFU, and random replacement.

CDN for Static Media

A content delivery network is a cache layer for static media such as images, scripts, and other files. A request can be served from the CDN if the content is available. If not, the CDN can fetch it from the back-end and store it locally.

If the system is not large enough for its own CDN setup, serve static media from a separate subdomain such as static.example-service.com using lightweight web servers. This makes it easier to move the static subdomain behind a CDN later by changing DNS.

Sharding and Data Partitioning

Data partitioning splits a database or table across multiple machines. The goal is better manageability, performance, availability, and load balancing. After a certain scale, horizontal scaling by adding more servers can be cheaper and more feasible than making one server bigger.

Horizontal Partitioning

Horizontal partitioning places different rows into different tables or servers. A range-based strategy might store locations by zip code range:

Table A: zip values below 100000
Table B: zip values 100000 and above

The risk is uneven data distribution. If the range is chosen poorly, one table can receive much more data or traffic than another.

Vertical Partitioning

Vertical partitioning splits data by feature. For a media-sharing product, one database server might store user information, another might store follower relationships, and another might store photos.

This is straightforward and can have a low impact on the application. But if one feature grows heavily, that feature-specific database may need its own partitioning later. For example, one server may not handle metadata queries for a very large number of photos and users.

Directory-Based Partitioning

Directory-based partitioning uses a lookup service that maps a key to a database server. This abstracts the partitioning scheme away from database access code.

Tuple key
  |
  v
Lookup service
  |
  v
Database server for that key

This makes it easier to add database servers or change the partitioning scheme. The tradeoff is that the lookup service is complex and can become a single point of failure if it is not designed carefully.

Partitioning Criteria

Common partitioning criteria include:

  • Key or hash-based partitioning: hash a key attribute to choose a partition.
  • List partitioning: assign each partition a list of values.
  • Round-robin partitioning: with n partitions, assign the ith item to partition i mod n.
  • Composite partitioning: combine strategies, such as hashing plus list partitioning.

A simple hash strategy can fix the total number of servers or partitions. If a new server is added, the hash function may need to change, causing redistribution and downtime. Consistent hashing helps reduce that disruption.

Common Sharding Problems

Sharding makes some database operations harder.

Joins are simple when related tables live on one server. They are harder and less efficient across shards because data must be compiled from multiple servers. A common workaround is denormalization, where data is duplicated so queries that previously needed joins can be served from one table. The risk is data inconsistency.

Referential integrity is also harder. Foreign keys across sharded databases are difficult, and many relational systems do not support them cleanly across shards. If the application requires referential integrity across shards, it may need to enforce it in application code and run cleanup jobs for dangling references.

Rebalancing is another common challenge. You may need to change a sharding scheme because of non-uniform data distribution or non-uniform request load. Adding a new database and rebalancing can require data movement and downtime unless the system has been designed to support it.

Indexes: Faster Reads with Write and Storage Costs

Indexes improve retrieval speed. An index can be created using one or more columns and can support rapid random lookups or efficient access to ordered records.

A practical way to think about an index is:

Indexed column value -> pointer to full row

Indexes create different views of the same data. They are useful for filtering and sorting large datasets without creating full additional copies of the data. They are especially helpful when data is spread across multiple physical devices and the system needs a way to find the correct physical location.

The tradeoff is write cost. On every write, the system must write the data and update the index. Indexes also increase storage overhead. Use indexes where they support real access patterns, not everywhere.

Proxies: Control and Optimize Request Traffic

A proxy sits between clients and back-end services. Proxies are useful under high load, especially when cache capacity is limited or when requests can be combined.

A proxy can:

  • Filter requests.
  • Log requests.
  • Transform requests.
  • Add or remove headers.
  • Handle encryption or decryption.
  • Compress data.
  • Coordinate requests.
  • Cache frequently used resources.

One useful proxy behavior is collapsed forwarding. If several clients ask for the same data at nearly the same time, the proxy can collapse those requests into one backend request, then share the result. This reduces repeated reads from the origin.

Client A --\
Client B ----> Proxy -> one origin request -> Back-end
Client C --/              |
                          v
                    shared response

Proxies are powerful, but they add another component that must be operated, scaled, and debugged.

Queues: Make Slow Work Asynchronous

In small systems, writes are often fast enough to do inline. In complex systems, high incoming load and slow individual writes can degrade performance. To keep user-facing requests fast, the system often needs asynchronous behavior.

A queue is an asynchronous communication protocol. The client sends a task, receives an acknowledgement from the queue, and continues its work. The acknowledgement can serve as a reference for checking results later.

Client
  |
  | submit task
  v
Queue
  |
  | workers pull tasks
  v
Worker Service
  |
  v
Database or External Service

Queues help with fault tolerance because they can protect the system from service outages, retry failed service requests, and avoid exposing clients directly to back-end failures. They can also enforce quality of service by limiting request size and the number of requests waiting in the queue.

Common queue implementations include RabbitMQ, ZeroMQ, ActiveMQ, and BeanstalkD.

Use a queue when immediate completion is not required. Do not use a queue to hide unclear consistency requirements. If users need a result immediately, the request path still needs a synchronous answer.

Consistent Hashing: Scale Partitions with Less Remapping

A basic distributed hash table might calculate an index using a hash function on the key. For a distributed cache with n servers, a simple strategy is hash(key) mod n.

That simple strategy has two major drawbacks:

  1. It is not horizontally scalable. Adding a new server changes the mapping for many keys and may require downtime.
  2. It may not be load balanced if data is distributed unevenly. Some caches become hot and saturated while others remain idle.

Consistent hashing reduces remapping when scaling up or down. Instead of mapping keys directly with modulo arithmetic, place hash values on a logical ring. Servers are hashed onto positions on the ring. Keys are also hashed onto the ring. To find the server for a key, move clockwise until reaching the next server.

hash(key) -> position on ring
move clockwise -> first server found
store or read key from that server

When a new server is added, only a portion of keys move to the new server. When a server is removed, only the keys assigned to that server need to move. If there are k total keys and n servers, the useful idea is that only about k/n keys need remapping instead of all keys.

Real-world data may still be unevenly distributed. Virtual replicas help by mapping each physical server to multiple points on the ring. More replicas usually creates a more even distribution and better load balancing.

Physical servers: A, B, C
Virtual positions: A1, A2, A3, B1, B2, B3, C1, C2, C3
Keys map clockwise to the nearest virtual position
Virtual position maps back to the physical server

Consistent hashing is especially useful for distributed caches and partitioned data stores.

Client-Server Communication for Real-Time Features

Not every system needs real-time communication. When it does, the protocol choice affects overhead, latency, and server behavior.

Basic HTTP Request and Response

The basic HTTP pattern is simple: the client sends a request, the server prepares a response, and the server returns it. This works well when the client knows when it needs data.

Client -> request -> Server
Client <- response <- Server

AJAX Polling

Polling means clients repeatedly ask the server for new data at regular intervals. The problem is that many responses may be empty because no new data exists yet. This creates HTTP overhead and wastes capacity.

HTTP Long Polling

Long polling, sometimes called a hanging GET, avoids empty responses. The client sends a request and waits. The server delays the response until new data is available or until a timeout occurs. After receiving a response, the client sends a new long-poll request, either immediately or after a short pause.

Client -> long-poll request -> Server
Client <- full response when update exists or timeout occurs <- Server
Client -> next long-poll request -> Server

Each request still has a timeout, so clients must reconnect periodically.

WebSockets

WebSockets provide a full duplex communication channel over a single TCP connection. After a handshake, the client and server keep a persistent, bidirectional channel open. Either side can send data at any time.

This reduces overhead compared with repeated polling and supports real-time data transfer.

Client <==== always-open bidirectional channel ====> Server

Server-Sent Events

Server-Sent Events create a persistent, long-term connection where the server sends data to the client whenever new data is available. The channel is unidirectional from server to client. If the client needs to send data back to the server, it needs another request or protocol.

Server-Sent Events fit systems where the server continuously generates updates and the client mainly listens.

A Practical Workflow for Designing the System

Use this workflow when you need to turn requirements into an architecture.

  1. Scope the problem.

    • Define the main user actions.
    • Identify reads, writes, static media, background tasks, and real-time behavior.
  2. Start with the simplest request path.

    • Client to load balancer.
    • Load balancer to web or application server.
    • Application server to database.
  3. Add redundancy where failure would be unacceptable.

    • Multiple web servers.
    • Multiple application servers.
    • Database replicas or failover paths.
  4. Add caching for repeated reads.

    • Start with application cache if simple.
    • Use global cache if many app servers need shared cached data.
    • Use distributed cache when cache space and throughput must scale.
  5. Decide the database model.

    • Use relational storage for structured data and strong transaction integrity.
    • Use non-relational storage when data is large, flexible, distributed, or rapidly changing.
  6. Add queues for slow or unreliable work.

    • Return an acknowledgement quickly.
    • Let workers process tasks asynchronously.
    • Retry failed tasks where appropriate.
  7. Plan partitioning before one database becomes too large.

    • Horizontal partitioning for rows.
    • Vertical partitioning for features.
    • Directory-based partitioning when the partition scheme needs flexibility.
  8. Add indexes for known access patterns.

    • Index fields used for frequent lookup, filtering, or ordering.
    • Avoid indexing fields that do not support real queries.
  9. Choose communication protocols for live updates.

    • Polling for simple low-frequency checks.
    • Long polling when updates are occasional but should be delivered quickly.
    • WebSockets for bidirectional real-time communication.
    • Server-Sent Events for server-to-client event streams.
  10. Revisit tradeoffs.

  • What happens during a network partition?
  • What data can be stale?
  • What component can become a single point of failure?
  • What operation gets slower because of consistency, indexing, replication, or caching choices?

Testing the Design with Failure Scenarios

A design is easier to trust when you can walk through failure cases.

Scenario 1: One Web Server Fails

Expected behavior: the load balancer stops sending traffic to the failed server and continues routing to healthy servers.

Check these points:

  • Does the load balancer run health checks?
  • Are sessions or state stored outside a single web server?
  • Can a new server be added without special conditions?

Scenario 2: Cache Miss or Cache Node Failure

Expected behavior: the application fetches from the origin database or storage, then repopulates the cache if appropriate.

Check these points:

  • Is there a safe fallback path to origin data?
  • Does the invalidation strategy match the consistency requirement?
  • Could random load balancing reduce cache hit rate?

Scenario 3: Database Partition Becomes Hot

Expected behavior: the team can identify uneven data or request distribution and rebalance or change partitioning.

Check these points:

  • Is the partition key creating uneven ranges?
  • Does the design allow new database servers?
  • Is there a lookup service, and is it itself highly available?

Scenario 4: Background Worker Is Down

Expected behavior: requests can still enqueue tasks, and failed work can be retried when workers recover.

Check these points:

  • Does the queue acknowledge accepted tasks?
  • Are there limits on queue size and request size?
  • Are clients protected from temporary worker outages?

Scenario 5: Network Partition Between Database Nodes

Expected behavior: the system follows a deliberate consistency or availability decision.

Check these points:

  • Does the system stop serving stale partitions to protect consistency?
  • Does it serve possibly stale data to preserve availability?
  • Is the tradeoff acceptable for the business operation involved?

Common Mistakes

Adding Components Before Knowing the Bottleneck

Load balancers, queues, caches, and shards are useful, but each one adds operational complexity. Start from the bottleneck. Add the component that solves that bottleneck.

Treating Caching as a Free Speedup

A cache improves reads only when hit rates are good and invalidation is correct. Write-through adds write latency. Write-around can make new reads miss. Write-back can lose data if the cache fails before flushing.

Choosing a Database by Popularity

A relational database may be the better choice for structured data and ACID requirements. A non-relational database may be the better choice for flexible schemas or very large distributed data. The access pattern and correctness requirements should decide.

Ignoring Rebalancing

A sharding plan that works on day one can fail later because data or requests are not evenly distributed. Plan how to add servers and move data before a single shard becomes overloaded.

Overusing Polling for Real-Time Features

Frequent polling can create many empty responses and unnecessary HTTP overhead. Long polling, WebSockets, or Server-Sent Events may fit better depending on the communication direction and latency needs.

Forgetting Single Points of Failure

A lookup service, global cache, queue, or proxy can become a new single point of failure. Any critical shared component needs redundancy or a recovery plan.

System Design Checklist

Use this checklist before finalizing a back-end architecture:

  • The problem scope is clear.
  • Functional requirements and constraints are written down.
  • The main request paths are explained.
  • Load balancing exists where multiple servers serve the same role.
  • Critical services have redundancy.
  • Critical data has replication or backups.
  • The database choice matches structure, scale, and consistency needs.
  • CAP tradeoffs are clear for distributed data.
  • Caching is placed where repeated reads justify it.
  • Cache invalidation strategy is defined.
  • Static media can move to a CDN or separate static-serving layer.
  • Slow work is handled asynchronously with a queue where appropriate.
  • Partitioning strategy is defined before data becomes too large.
  • Rebalancing risks are understood.
  • Indexes support real lookup, filtering, or sorting patterns.
  • Proxies are used deliberately for request control, caching, or traffic optimization.
  • Real-time communication protocol matches the direction and frequency of updates.
  • Failure scenarios have been walked through.
  • Every major design choice has a tradeoff attached to it.

Conclusion

System design is the practice of making tradeoffs visible. Start with the simplest working request path, then add components only when they solve a clear scaling, reliability, or performance problem.

Load balancers spread traffic. Databases store state with different consistency and schema tradeoffs. Replication protects against failure. Caches reduce repeated work. CDNs move static media closer to users. Sharding spreads data across machines. Indexes speed reads but slow writes. Proxies optimize request traffic. Queues make slow work asynchronous. Consistent hashing reduces remapping when nodes change. Real-time protocols control how clients and servers exchange updates.

The best design is the one you can explain under pressure: what each component does, why it exists, what it costs, and how the system behaves when something fails.

Share:

Comments0

Home Profile Menu Sidebar
Top