EngineeringApril 10, 20267 min read

5 signs your backend needs re-architecting

Not every performance issue means you need a rewrite. Most of the time, you add an index, optimise a query, bump the instance size, and move on. But there are patterns that indicate something structural is wrong - problems where the fix is not better code in the current architecture, but a different architecture entirely. These are the five we see most often.

1. P99 latency is climbing and nobody changed anything

Your dashboards show p99 response times creeping up week over week. 80ms three months ago. 140ms last month. 210ms this week. No one shipped a slow query. No one introduced a new dependency. The code is the same. But the data is not.

What's happening underneath: The system was designed for a dataset size or traffic volume that has been exceeded. The most common culprit is a database query that performs a sequential scan because the table grew past the point where the query planner uses the index effectively. But it can also be connection pool exhaustion - your pool was sized for 50 concurrent queries and you now routinely need 200. Or it's serialisation overhead - the JSON payloads your API returns have grown from 2KB to 50KB as the data model expanded, and the serialisation cost that was invisible at 2KB now dominates the request.

Why patching won't fix it: You can add indexes and they'll help until the next table doubles. You can increase the connection pool but that shifts pressure to the database, which starts rejecting connections or slowing down. You can paginate responses, but if the underlying query still touches millions of rows to compute the first page, pagination is cosmetic. The fundamental issue is that the data access patterns have outgrown the data model and query architecture.

What the re-architecture looks like: Separate read and write paths (CQRS). Materialised views or pre-computed read models that serve queries without hitting the primary data store. A caching layer (Redis, Memcached) for frequently accessed aggregations. If the dataset is truly large, a purpose-built read store - Elasticsearch for search, ClickHouse for analytics, a denormalised Postgres schema optimised for the specific access pattern. The goal is to make query performance independent of total data volume.

2. Deploying takes longer than developing

A one-line bug fix takes 15 minutes to write and 45 minutes to deploy. The CI pipeline runs the entire test suite (25 minutes). Then it builds a Docker image that includes every service in the monolith (8 minutes). Then it does a rolling deployment that takes 12 minutes because each instance needs 90 seconds to start up and pass health checks. A developer ships two changes per day because the pipeline is the bottleneck.

What's happening underneath: The codebase has become a deployment monolith. Even if the code is logically modular, the build and deployment process treats everything as a single unit. Every change triggers every test. Every build includes every dependency. Every deployment restarts everything. The startup time is long because the application initialises database connections, warms caches, loads configuration, and runs migrations on every boot - regardless of what actually changed.

Why patching won't fix it: You can parallelise tests, but if 80% of your tests are integration tests against a shared database, parallelism is limited by database connection availability. You can use multi-stage Docker builds to cache layers, but a dependency update invalidates the cache. You can skip tests on certain paths, but that introduces risk and eventually a skipped test would have caught a real bug. The problem is that the deployment unit is too large for the rate of change.

What the re-architecture looks like: Decompose the deployment boundary. This does not necessarily mean microservices - it can mean a well-structured monorepo where each module has its own build pipeline, its own test suite, and its own deployment. The key is that a change to the payments module only builds, tests, and deploys the payments module. Nx, Turborepo, or Bazel can provide the dependency graph analysis to determine what needs rebuilding. Independent deployment means independent testing, independent builds, and independent rollbacks. A one-line fix deploys in 3 minutes, not 45.

3. One service going down takes everything down

The email notification service throws an unhandled exception. Within seconds, the API that calls it starts timing out. The frontend retries those requests. The API's thread pool fills up. The load balancer marks API instances as unhealthy because they're not responding to health checks. Users get 502 errors across every feature, even the ones that have nothing to do with email notifications.

What's happening underneath: The system has synchronous, tightly-coupled dependencies without isolation. Service A calls Service B synchronously in the request path. If B is slow, A is slow. If B is down, A blocks waiting for a response until the TCP timeout (often 30 seconds by default). Those blocked threads or connections accumulate until A's capacity is exhausted. A is now effectively down too, and anything that depends on A cascades further.

Why patching won't fix it: You can add timeouts (and you should - a 30-second default TCP timeout is almost always wrong). You can add retries with exponential backoff. You can add a circuit breaker. These are necessary and they'll reduce the blast radius. But if the fundamental interaction pattern is synchronous request-response across critical paths, you're building resilience on top of a fragile foundation. The circuit breaker prevents cascading failure, but it still means the feature that depends on the failed service is unavailable.

What the re-architecture looks like: Move non-critical interactions to asynchronous messaging. The email notification doesn't need to happen in the HTTP request path. Publish an event to a message queue (RabbitMQ, SQS, Kafka) and return the response to the user immediately. The notification service consumes the event independently. If it's down, messages queue up and are processed when it recovers. The user-facing request path never depends on the notification service's availability. For interactions that must be synchronous, implement bulkhead isolation - separate thread pools or connection pools per downstream dependency, so one slow dependency cannot exhaust the capacity reserved for others.

4. You can't add features without breaking existing ones

The product team wants to add a new pricing tier. The change touches the billing module, the permissions module, the API response schema, the frontend rendering logic, and the analytics pipeline. Three out of the last four feature releases caused regressions in unrelated functionality. Developers are afraid to change anything because they don't fully understand the consequences.

What's happening underneath: High coupling and low cohesion. Modules share database tables directly instead of communicating through defined interfaces. Business logic is scattered across layers - validation in the API handler, more validation in the service layer, business rules encoded in database triggers, and derived state computed in the frontend. A single concept like "user subscription" is represented in 6 different places with 6 slightly different interpretations. Changing one representation silently breaks the others.

The test suite offers false confidence. Tests pass because they test individual functions in isolation, but the bugs are in the interactions between modules. Integration tests exist but they test the happy path. The failure mode is always an edge case in the interaction between two modules that each work correctly on their own.

Why patching won't fix it: More tests help at the margin, but they treat the symptom. Code reviews catch obvious coupling but miss the subtle kind - two modules reading from the same database table are coupled even if they never import each other's code. Linting rules can enforce import boundaries but can't enforce data ownership boundaries.

What the re-architecture looks like: Define clear bounded contexts with explicit ownership of data and behaviour. Each domain (billing, permissions, analytics) owns its data and exposes it through a defined interface - an API, an event stream, or a well-defined module boundary. No shared mutable state between domains. The billing module publishes a "subscription changed" event. The permissions module consumes it and updates its own data store. They never read each other's tables. This is Domain-Driven Design applied at the architectural level, and it's the only reliable way to make large systems modifiable without fear.

5. Your database is the bottleneck for everything

Every slow API endpoint, when you trace it, ends at the database. The database CPU is at 80%. The connection count is at the limit. Slow query logs show queries across a dozen tables. You've added every index you can think of. You've vertically scaled the instance twice this year. You're running the second-largest instance your cloud provider offers and you're already planning the migration to the largest.

What's happening underneath: A single relational database is serving as the read store, write store, search engine, analytics engine, and job queue. User-facing queries compete for resources with background analytics aggregations. A reporting query that scans millions of rows locks pages that the OLTP queries need. The connection pool is shared between the API (which needs fast, short-lived queries) and the background workers (which run long, complex queries). Every workload degrades every other workload.

The schema has evolved organically. Tables have 60+ columns because new features added columns instead of new tables. Polymorphic associations (a type column that changes the meaning of other columns) make indexing ineffective because the query planner can't optimise for all the different shapes of data in the same table. JSON columns store semi-structured data that can't be indexed efficiently.

Why patching won't fix it: Vertical scaling has a ceiling and the cost curve is exponential. Read replicas help for read-heavy workloads but don't reduce write contention on the primary. Query optimisation has diminishing returns when the fundamental issue is workload contention, not query inefficiency. Caching reduces read load but introduces cache invalidation complexity, and the writes are still hitting the single primary.

What the re-architecture looks like: Decompose the database by workload type. The OLTP workload (user-facing reads and writes) stays on a right-sized Postgres instance with a schema optimised for transactional access patterns. Analytics and reporting move to a columnar store (ClickHouse, BigQuery, Redshift) fed by CDC (Change Data Capture) streams from the primary. Full-text search moves to Elasticsearch or Typesense. The background job queue moves to Redis or a dedicated queue (SQS, RabbitMQ). Each store is optimised for its workload and scales independently.

If the data volume is genuinely large (hundreds of millions of rows in hot tables), horizontal sharding of the primary database may be necessary. This is the most invasive change and should be the last option - but if your single-largest-instance database is at capacity, it's the only option that provides continued growth.

How to know if it's time

One of these signs on its own might be a local problem with a local fix. Two or more appearing simultaneously almost always means the architecture has reached its structural limit. The system isn't slow because of bad code - it's slow because the architecture was designed for a different scale, a different feature set, or a different workload pattern than what it's running today.

Re-architecting is expensive. Not re-architecting when you need to is more expensive. The compounding cost of slower development velocity, more frequent incidents, and increasing infrastructure spend will exceed the cost of the re-architecture within months - and the gap grows every week you wait.

The right time to re-architect is when you can still plan it. The wrong time is after the outage that forces it.

If you're seeing these patterns, Kaev can assess your architecture and build the path forward - without stopping feature development. Talk to us.

Back to blog