Subscribe for more posts like this →

Why Code Reviews and Tests Don't Improve What Matters

The Feedback We Think We're Getting

Software development has feedback mechanisms built into its standard practice. Code reviews before merging. Test suites that must pass. API contracts that must be satisfied. These are supposed to catch problems, enforce quality, and help programmers improve.

But here's the uncomfortable pattern: despite all this feedback, codebases still deteriorate. Systems still become harder to modify over time. Bugs still escape to production. New programmers still struggle to understand existing code.

The feedback mechanisms aren't broken in the sense that they fail to execute—code reviews happen, tests run, APIs enforce their contracts. They're broken in a more subtle way: they provide feedback on the wrong things, or more precisely, on things that don't correlate strongly with what we actually care about.

Code Review as Opinion Exchange

Let's start with code reviews, since they're positioned as a primary quality gate in most organizations. You write code, submit it for review, and other programmers examine it before it merges. The theory is that multiple pairs of eyes catch more problems than one.

What actually happens?

The most likely feedback you'll receive is other programmers' opinions on the same questions you already thought about: Should this be a separate function? Is this variable name clear enough? Should we use a library for this or write it ourselves? These are the questions you already wrestled with while writing the code. You made choices. Now other programmers tell you they would have made different choices.

This feedback is backed by nothing more than their individual experience and aesthetic preferences. Person A thinks functions should rarely exceed ten lines. Person B thinks that's arbitrary and prefers to keep cohesive logic together even if it's longer. Person C thinks the variable name userConfig is clearer than config. Person D thinks it's redundant given the context. Nobody is wrong, exactly. These are preferences.

The review conversation becomes a negotiation between preferences. Sometimes it's educational, you learn that the other person is thinking about future modification patterns you hadn't considered. More often it's just friction. Energy expended to reach consensus on questions that don't have objectively correct answers.

Here's what makes this particularly frustrating: code reviews rarely catch the problems that actually matter. They don't tell you that your approach will perform poorly at scale because the reviewer doesn't know your data volumes. They don't tell you that your abstraction will break when requirements change next quarter because the reviewer doesn't know what's in the product roadmap. They don't tell you that your code introduces subtle timing dependencies because the reviewer is looking at a diff, not running the system under load.

What code reviews primarily enforce is local norms.

  • This is how we name our variables.
  • This is how we structure our tests.
  • This is the pattern we use for error handling.

These norms have value, consistency makes code more predictable, but they're not the same as quality. You can have perfectly norm-compliant code that's still hard to maintain, test, or modify.

The Questions You Already Asked Yourself

There's a particularly dispiriting aspect of code review feedback: it almost always addresses decisions you already considered.

You wrote a 50-line function. You know it's longer than ideal. You considered breaking it up. You decided the function is cohesive enough that splitting it would just create indirection without adding clarity. You made a choice.

The reviewer comments: This function is too long. Consider breaking it into smaller functions.

Now you're in an awkward position. You can explain your reasoning, which takes time and might come across as defensive. You can acquiesce and split the function even though you don't think it improves anything. Or you can push back, which creates tension.

None of these outcomes involve learning something new. The reviewer hasn't identified a problem you missed. They've disagreed with a tradeoff you made. This is fine as collaboration, sometimes hearing another perspective helps you see things differently. But it's not feedback in the sense of information you lacked that improves your work. It's just another opinion on a judgment call.

This happens because code reviews operate on surface features. The reviewer can see that your function is 50 lines. They can see your variable names, your control flow, your comments. What they can't easily see is:

  • Why you structured it this way instead of alternatives
  • What constraints you were working under
  • What future changes this design accommodates or prevents
  • How this code relates to the broader system architecture

These deeper concerns are invisible in a diff. To evaluate them requires understanding context that's usually not in the code review tool. So reviewers fall back to evaluating what they can see: surface features. And since everyone has opinions about surface features, that's where the discussion happens.

The Bugs That Escape

Here's a revealing question: when was the last time a code review caught a real user-visible bug in your codebase?

Not a style violation. Not a potential future problem. Not something that could theoretically cause issues under circumstances that never occur in your system. A actual bug - code that would misbehave in production, impact users, require a fix.

For most teams, the answer is rarely or never. Code reviews catch typos occasionally. They catch obvious logical errors if you're lucky. But the subtle bugs - the race conditions, the edge cases, the incorrect assumptions about data, those slip through.

This isn't because reviewers are incompetent. It's because the kind of bugs that matter in production aren't visible in static code review. They emerge from the interaction between your code and the system it runs in: the load patterns, the data distribution, the timing of concurrent operations, the behavior of external services.

You can't see these things in a diff. You can barely see them in the full codebase. They only become apparent when the system is running, preferably under production-like conditions.

So what's happening is that we've created an expensive quality gate. Code review takes time, creates blocking dependencies, requires coordination, that catches mostly stylistic issues while missing the actual problems. The feedback mechanism is working hard but producing little signal.

The Testing Paradox

Testing has a similar problem, though it manifests differently. We've convinced ourselves that testing is about quality: write tests, catch bugs, ship better software.

But then we develop this peculiar attitude: a test that reveals a problem is a success. A test that passes - that finds no problems - was a waste of time.

Think about what this implies. You spend an hour writing a test. The test passes. You haven't found a bug. Therefore, that hour produced no value? That's the implicit logic.

This is backwards. A passing test isn't a waste. It's confirmation that a potential failure mode doesn't occur, or documentation of expected behavior, or a constraint that prevents future regressions. But because we frame testing as bug finding, passing tests feel unproductive.

The result is that programmers often don't write tests unless they're either required to (by team policy or TDD discipline) or trying to debug something specific. Testing becomes a chore rather than an engineering tool.

This is compounded by what tests actually check. Most tests verify the happy path: given valid inputs, does the system produce expected outputs? This is useful. But it's not where most real problems live.

Real problems live in:

  • What happens when the database is slow?
  • What happens when two requests arrive simultaneously?
  • What happens when the input is valid but unexpected?
  • What happens when a dependency starts returning errors?
  • What happens when memory is constrained?

These are hard to test. They require infrastructure to simulate production conditions. They're slow to run. They're flaky because they depend on timing. So mostly we don't write them. We test the paths we know work and hope the paths we're unsure about don't matter.

The feedback from testing, then, is heavily biased toward - things are fine even when they're not. You have a green test suite. Does this mean your code is correct? No, it means your code handles the scenarios you thought to test, under the conditions you simulated. The gaps in your testing are invisible.

APIs as Thin Glue

When you call an API - whether it's a library function, a web service, or a system call - you're working with remarkably little information. You have a function name, a parameter list, maybe a documentation comment. This is the thin glue holding your software together.

What you don't have:

  • How reliable is this API? Does it fail 0.01% of the time or 5% of the time?
  • What happens when it fails? Does it throw an exception, return an error code, hang indefinitely, or corrupt state?
  • What are its performance characteristics? Is it fast, slow, variable? Does it scale linearly with input size?
  • What are its dependencies? Does it do disk I/O, network calls, locking?
  • What are its guarantees? Is the operation atomic, idempotent, eventually consistent?

Some of this might be in documentation, if you're lucky. Most of it isn't. So you call the API based on hope and assumptions. You assume it's reasonably reliable. You assume failures are exceptional. You assume performance is acceptable. You assume it won't cause mysterious problems.

Then, months later, you discover that the API sometimes returns stale data. Or it holds a lock that creates contention. Or it makes a synchronous network call that blocks your thread. The thin glue turns out to be weight-bearing, and you didn't know it.

This is a feedback problem: the interface gives you no information about the properties that matter for reliability. It tells you the syntax (how to call it) but not the semantics (what guarantees it provides). You fly blind until something breaks.

Even worse, when something does break, the feedback is indirect. The API didn't fail. Your system failed because it used the API in a way that turned out to be problematic. Was that a misuse of the API or a flaw in the API? Often unclear. The API worked according to its specification (such as it was). Your assumptions about it were wrong.

Compounding Over Time

These broken feedback mechanisms compound. You get code reviews that enforce style but miss substance. You get tests that pass but don't prove correctness. You call APIs based on assumptions that turn out wrong. Each instance is survivable. The accumulated effect is that codebases become harder to work with over time, and nobody knows exactly why.

You add features, and they take longer than expected. You fix bugs, and new bugs appear elsewhere. You try to refactor, and discover dependencies you didn't know existed. You bring in new team members, and they struggle to understand the code despite it following all the standard practices.

The feedback mechanisms told you everything was fine. The code reviews passed. The tests were green. The APIs worked. But somehow the system is still hard to work with, and getting harder.

This is because the feedback mechanisms are measuring proxies for quality rather than quality itself. They measure whether you followed conventions, whether you handled test cases, whether you used APIs correctly. They don't measure whether your system is actually well-designed, maintainable, or reliable under production conditions.

The gap between the proxy and the real thing is where problems accumulate. You're optimizing for passing code review, not for code that's easy to modify. You're optimizing for green tests, not for systems that handle unexpected conditions gracefully. You're optimizing for API compliance, not for understanding your dependencies' behavior.

The Invisible Problems

The most pernicious aspect of broken feedback mechanisms is that they hide problems from view. When your feedback says everything is fine but your experience says this is getting harder, you start to doubt your experience. Maybe you're just not as sharp as you used to be. Maybe the problem domain is inherently complex. Maybe this is just what software development is like.

You don't consider that the feedback might be lying - not maliciously, but structurally. That it's giving you accurate information about surface properties while missing the deeper properties that determine whether your software is well-engineered.

Code review tells you your style is consistent but not whether your abstractions are appropriate. Tests tell you your logic is correct but not whether your architecture is sound. APIs tell you the syntax is right but not whether your assumptions are valid.

These are all failures of feedback: information that should guide you toward better decisions but instead reassures you that everything is fine while problems compound in the background.

The solution isn't to eliminate code reviews, testing, or API contracts. These mechanisms provide value, even if limited. The solution is to recognize their limitations and develop better feedback mechanisms. Ones that actually measure the properties we care about: maintainability, modifiability, reliability under production load.

But that requires knowing what those properties are and how to measure them. Which brings us back to the fundamental problem: we don't have a systematic way to think about software engineering. We have practices that seem like they should help, and sometimes do, but we're never quite sure if they're working.

If you want to learn the thinking patterns that address these issues - visit stackshala. Our course addresses a lot of these issues - not by another practise, framework or patterns but using first principles thinking.

Read more

Subscribe for more posts like this →