The API Design Manual

The API Design manual I wish someone else had written

Maneesh Chaturvedi

25 May 2025 — 13 min read

The API Design manual I wish someone else had written

Introduction

I’ve always had a deep interest in API design. Over the years, I’ve spent a fair bit of time studying how the best systems in the world expose their capabilities — cleanly, safely, and at scale. Recently, I decided to revisit the topic with fresh eyes. I dove back into the public facing design principles of some of the industry’s most admired platforms: Amazon, Stripe, Notion, and a few others.

What I found was a sea of dense documentation and buzzword-heavy manifestos that made me pause. Terms like “operational independence,” “minimal surface area,” “strong contracts,” and “loose coupling” were everywhere. They sounded authoritative, but often felt hollow. After a while, I wasn’t sure if I was reading about APIs or abstract sculpture.

To give you a taste, here’s a sample of what I came across:

driven by operational independence, long-term durability, and minimal coordination cost

everything is an API

loose coupling, strong contracts

backward compatibility is sacred

offer abstraction without leakage

prevent tight coupling

enforce isolation and fault boundaries

ensure version control and fallback behavior

stable interfaces, explicit schemas, and composable trust

Now, I don’t doubt that behind these phrases are real, hard-won insights. But I do think they’ve been abstracted so far from reality that the practical lessons are getting lost. So I decided to write the kind of piece I wish I had found: a guide to API design that doesn’t talk down, doesn’t oversimplify, but also doesn’t hide behind buzzwords.

This is my attempt to peel back the layers, demystify the language, and lay out the core ideas behind good API design — clearly, honestly, and with examples that come from real-world constraints, not just architectural poetry.

If you’ve ever read a beautifully vague sentence like “expose capabilities, not internals” and asked yourself “but what does that actually mean?”, this one’s for you.

Principles

APIs are not just technical boundaries. They are architectural declarations of trust. An API says: “Here’s what you can rely on — here’s where our responsibilities end and yours begin.”

That simple boundary becomes a foundation for ecosystems, businesses, and tools that may evolve far beyond the original vision. So, how do you design APIs that aren’t just functional, but empowering?

APIs Aren’t Just Interfaces. They’re Invitations

Every API is an invitation to build. It sets the tone for how integrators think, the assumptions they make, and the confidence they have.

Consider the Shopify API. When it matured into a reliable, well-versioned platform, it unlocked a multi-billion dollar app ecosystem.

Thousands of developers built businesses on top of it, not because it exposed everything, but because it promised consistency, evolution, and respect for their investment.

Contrast this with an internal tool turned public API with no real investment in stability.

Twitter’s early API strategy suffered precisely because it didn’t clarify its intentions.

Developers flocked to build on it, only to be cut off when the business model changed.

That’s not an API; that’s a trap disguised as an opportunity.

API Design Is a Strategic Commitment

Treating an API as a surface-level feature is a mistake. It is a strategic, architectural commitment.

The decision to expose part of your system isn’t trivial, it creates a social contract with your users. Look at AWS. Every API they ship is an architectural pillar.

They don’t just build features , they define stable, long-lived interfaces that entire infrastructures can rely on. EC2, S3, Lambda - all are APIs first, products second.

When Amazon says “this is part of the public API,” it signals years of support, backward compatibility, and future-proofing.

That’s why CIOs bet on them.

That’s why developers trust them.

That’s why startups scale with them.

The Two Failure Modes

Expose Everything: The Pitfall of Accidental APIs

Exposing too much creates brittle dependencies.

Consider Jenkins, the popular CI/CD system. For years, plugins reached deep into internals because the API surface was too limited. The result?

Plugins broke every time the core changed. This wasn’t a plugin ecosystem, it was source patching disguised as extensibility.

Every time an internal method gets called from the outside, it becomes part of your accidental API.

You can’t remove it without breaking someone’s build.

You can’t fix a bug without risking regressions downstream.

This isn’t empowerment, it’s entrapment.

Expose Nothing: The Path to Irrelevance

At the other extreme are platforms so tightly controlled that meaningful extension is impossible.

Apple’s early restrictions on HomeKit led to years of stagnation in smart home innovation. Developers couldn’t access meaningful parts of the stack, and workarounds were brittle or outright banned.

It wasn’t until Apple opened up key capabilities and published clearer APIs that the ecosystem began to grow in earnest.

Restrict too much, and nothing gets built.

The Sweet Spot: Empower Without Entangling

The best APIs don’t expose internals. They expose capabilities.

Kubernetes is a great example here. It offers stable APIs for pods, services, and operators without requiring access to internal scheduler logic.

Developers build entire platforms on Kubernetes, without having to patch its internals.

By focusing on capability exposure that is, “what you can ask for” rather than “how it works inside”, K8s allows integration without entanglement. That’s a powerful distinction.

Dogfooding

Nothing forces humility in API design like having to use your own interface.

Stripe famously builds its dashboard and internal tools using the same APIs it gives to customers. If a limitation exists, Stripe developers feel it first. If an edge case is painful, it hurts them too.

That feedback loop ensures the API is not just functional but developer-friendly.

Contrast that with internal only tools wrapped with last minute API layers for third-party use.

They often feel like an afterthought. If you don’t live within your own API’s constraints, why should anyone else trust them?

Composable Trust

Composable trust means creating boundaries that allow others to innovate independently without putting your system at risk. It’s what makes an API scalable, not just in traffic, but in collaboration.

Consider Notion. Their long-awaited public API wasn’t just about letting users pull data. It was about enabling automation, custom workflows, and new products.

They approached it slowly, some say too slowly — but with a clarity of purpose. When the API arrived, it was built on explicit contracts, extensible blocks, and predictable behavior.

The ecosystem is now blooming with integrations, bots, and tools.

Trust scales when it’s composable. And composable systems need APIs designed not just to serve, but to share control responsibly.

Start Shallow

And finally, the golden rule: start shallow. Add later. Every new feature, field, or endpoint you expose is a promise you now have to keep. If you’re not sure whether to expose it, wait. Learn from how people use the system. Let demand pull the surface area outward. It’s always easier to add something later than to take it away. Start with what you’re sure about. Let the rest evolve.

The Real-World Complexity Beneath the Principles

Lambda has a clean API, but hides setup complexity behind DevOps.

When people say “Lambda abstracts away infrastructure,” what’s often glossed over is that the abstraction requires a fair amount of prior configuration:

You must define execution roles.
Set up VPC access (if needed).
Deal with IAM permissions.
Configure timeouts, concurrency limits, and possibly DLQs.
And often, hook it into an event source (API Gateway, SQS, etc.).

That’s not removal of complexity — it’s shifting complexity from execution time to set-up time, often into a DevOps/SRE responsibility zone.

This is a tradeoff by design:

The API stays clean and easy to reason about at the point of invocation, but the cost is higher upfront setup effort and organizational alignment.

Why Amazon still does this

Because the complexity is more manageable when it’s centralized:

Set up roles once.
Control policies in a centralized place (IAM).
Reduce risk at runtime by tightening the perimeter beforehand.

The payoff? Once set up, the operational surface is dramatically simplified for downstream teams.

So what does this mean for API design?

It means:

A clean API often depends on messy agreements elsewhere , usually policy, identity, or provisioning infrastructure. That doesn’t make the API bad, but it does mean the design has bounded focus. It’s honest about what it solves and what it delegates.

You can expose simple endpoints only if you require the hard work to be done before they are used.

Versioning sounds great, but how do you keep versions compatible in the real world?

Yes, versioning is not magic. The question isn’t whether to version — it’s how to version well.

Real-world challenges include:

Surface area creep: more features = more interfaces = more chances to get stuck with a legacy burden.
Early instability: until real users try the API, you don’t know what assumptions will break.
Implementation leakage: accidental exposure of internals makes future changes painful.

Amazon’s approach (as inferred from practice):

Don’t expose features until they’re stable internally. This avoids premature exposure of brittle designs.
Use “preview” or “beta” markers to ship capabilities without committing to them. (e.g. X-Amz-Beta-Feature)
Design with layering: lower-level APIs are minimal and stable (e.g. S3’s PUT/GET), while advanced orchestration (e.g. CloudFormation or Step Functions) wraps those APIs with more evolving contracts.
Minimize output dependencies: avoid giving clients too much structure to rely on in responses unless you plan to freeze it forever.

Versioning takes multiple iterations, often without public exposure. Good versioning depends on resisting the urge to expose too early.

Idempotency — Easy to talk about, but is it

Let’s break it down.

Idempotency works for:

PUT-style operations (set this value to X)
Creating something only if it doesn’t already exist (with a unique token)
Enqueuing an event, where duplication is acceptable or can be deduplicated

But…

What about true side effects?

Let’s say you charge a credit card and it triggers downstream systems, not just a balance update but also an inventory change, email, webhook, and analytics pipeline

You can’t blindly retry this.

How do real systems handle this?

They decompose operations into state + effect pipelines, e.g.:

State change: Store the transaction intent (e.g., “charge pending”) in a database.
Effects: Trigger side effects asynchronously via message queues.

So even if the request is retried, the actual side effects are deduplicated at the worker level by checking whether this event was already processed.

Real-world design: Stripe

Stripe’s charge API isn’t just POST /charge. You send an idempotency-key and it stores intent.

Then, webhook workers emit events like charge.succeeded, which your system can consume exactly once, even if the upstream was retried.

So the answer isn’t always “make the HTTP request idempotent.”

Sometimes the answer is: split the operation into a durable, idempotent state change and a resilient, eventually-consistent effect propagation.

Dogfooding using your own API internally — is a great way to test if it’s usable, intuitive, and complete. But it doesn’t mean internal teams get a free pass to do whatever they want.

If anything, it makes discipline more important. Without proper guardrails, internal teams often start relying on undocumented behaviors, passing debug flags, or using unstable headers.

Imagine you’re building an internal payments service and exposing an endpoint like:

POST /charge

Internally, the billing team may know that appending a debug flag like ?simulate_failure=true triggers fallback logic. If you don’t restrict that feature, other internal teams may start using it too.

These backdoors tend to stick, and over time, they become unintentional contracts. By the time you want to clean up or externalize the API, you realize you’ve painted yourself into a corner.

Internal use should follow the same contracts you want your external users to rely on. Someone has to be the gatekeeper, or those internal hacks become everyone’s problems.

Trust boundaries are not binary. You don’t just trust or distrust a client.

You design access levels.

Maybe anonymous users can fetch public resources.

Maybe authenticated users can edit.

Maybe admins can delete.

Every capability has a cost if abused. The API defines what level of safety each user needs to access each feature.

This could involve role checks, permissions, scopes, or even contextual constraints like IP or environment.

Smart systems scale trust by limiting what each actor can do, not by trusting everyone equally.

Now consider a deeper problem: internal library churn.

Say your public API depends on a chain of internal services, and one of them uses a utility that gets silently updated. That update changes how a timestamp is formatted or how a validation rule is applied.

Suddenly your users see different behavior, but the API hasn’t changed. From their point of view, it’s a regression.

From yours, it’s an invisible shift. These failures are dangerous because they violate the spirit of stability without violating the letter.

The only way to protect against this is to treat your internal module boundaries like APIs too.

Test them. Version them. Track changes. Don’t assume internal means harmless.

Summary of Reality-Adjusted Principles

A clean API doesn’t eliminate complexity, it concentrates it elsewhere. Know where you’re moving the mess.
Versioning requires discipline, iteration, and sometimes restraint — expose less, later.
Idempotency is necessary but not sufficient. When operations have external effects, design for eventual consistency and deduplication, not retries alone.
Dogfooding still requires gatekeeping.
Trust boundaries require access levels.
API regression can be layers deep.

Deciding What to Expose: Real-world Use Cases

Knowing what your system can do is not the same as knowing what it should let others do.

Just because a backend has a powerful internal operation doesn’t mean it’s safe, useful, or sustainable to expose it.

The decision to expose a capability should begin not with implementation, but with someone else’s goal.

What are your users trying to achieve, and does giving them this particular ability help them get there more reliably, efficiently, or confidently?

Take the example of S3’s CopyObject feature. It wasn’t part of the original API launch. At first glance, that seems like a gap—it’s an obviously useful capability, and in hindsight, you’d wonder why it wasn’t included from day one.

But the truth is more nuanced. While it was always desirable to allow server-side copies of objects, it likely wasn’t easy to do safely or efficiently in the early architecture.

Without guarantees about consistency, durability, and cost management, exposing that feature too soon could have caused more harm than good.

Its delayed introduction reflects not neglect, but maturity: Amazon waited until it could offer that power without leaking complexity or reliability risks to users.

Another example comes from Notion, whose API choices reflect a clear alignment with its product’s philosophy.

Notion’s value lies in the way it structures content : its block model, page hierarchy, and structured databases. When they launched the public API, they didn’t rush to expose everything.

Instead, they focused on enabling automation and structured updates, while deliberately avoiding things like arbitrary styling, full-text search, or workspace-level administration.

Even though users wanted these, exposing them would have either compromised the structure that makes

Notion consistent or opened the door to misuse. Their restraint didn’t come from limitation, but clarity: they chose to enable extensions that matched the way the product was meant to be used.

Exposing a capability isn’t just a technical act, it’s a long-term commitment. If you expose something today that isn’t stable, or isn’t likely to survive architectural evolution, you’re tying your hands.

Customers will build on it. Documentation will depend on it. SDKs will wrap it. And when it breaks or needs to change, your team will be stuck supporting it far longer than you’d like.

This is why teams like Stripe often ship internal features early but delay external exposure. They want to see how it behaves in production, how edge cases surface, and whether the abstraction really holds.

Once it does, they expose it in a way that’s clear, minimal, and forward-compatible.

But even a well-chosen capability can go wrong if it leaks internal design. A useful API should hide the mechanics of how the system works.

That doesn’t mean faking it, it means designing so that users can focus on what they want to do, not how your backend was wired together.

For instance, Firebase Firestore allows you to set a document with a simple set()call. Under the hood, that may involve retries, local caching, conflict resolution, or sync queues.

But the user never sees that. They see: store this document and it just works. The interface is designed around intent, not mechanics.

To avoid leaking your internal design, you have to be careful with the data you return and the errors you surface. Don’t return internal queue names, host IDs, or error messages from your dependency layers.

If a document fails to sync, tell the user something like: “Could not complete operation. Please try again later.”

Don’t tell them, “Node failure in zone-b3 shard-17.” Users want clarity, not architecture lessons.

Similarly, reserve backend-specific metadata for internal logging, not client-facing responses.

One of the hardest trade-offs comes when an error occurs and the user actually needs to take action.

For example, in a file upload system, a failed upload due to a transient network issue can be retried silently.

But what happens if the file passes upload but fails the virus scan step afterward? Now the system has received the file, but can’t proceed.

You can’t retry the scan silently if the failure persists, and you can’t re-ask the user to upload again without confusing them.

In such cases, the system must expose just enough information to prompt useful user action — Security scan failed, please try uploading a different file while still hiding the gory backend details.

The trick is not to avoid exposing failures, but to present them in ways that reflect user-level context, not internal state.

Exposing capabilities, then, is about more than surfacing power. It’s about designing interfaces where the user’s goal drives the shape of the API, not your implementation.

It’s about choosing not just what users can do, but how much of your world they need to see to do it. The best APIs let people build boldly while letting the system itself stay invisible and stable underneath.

Limitations of an API

Every API has limits. Some are technical, some are intentional, and some come from practical tradeoffs.

But here’s the important part: you can’t design a trustworthy API unless you’re clear about what it doesn’t do, where it might break, and what users should never rely on.

If your API is a road, then limitations are the road signs, speed limits, and warnings. They don’t make the road worse. They make it safer.

Sometimes the most important thing to say is “you can’t do that here.”

For example, imagine you have a document storage API, and users can archive old documents. That’s fine. But maybe once a document is archived, it becomes read-only and can’t be edited.

This needs to be absolutely clear in your documentation, in your error messages, and in the design of the API itself.

If someone tries to edit an archived document, the API should not only reject the request, but return a response that says exactly why.

No system is infinite. There are always constraints around how fast you can make requests, how much data you can send, or how often you can call an endpoint.

GitHub’s API, for example, lets users make 5,000 authenticated requests per hour. When users approach that limit, the API responds with headers that explain how much quota remains and when it will reset.

This isn’t just a rule, it’s a conversation. The system is saying: “You’re almost there, slow down.” If users don’t know these constraints exist, they’ll assume the API is unreliable when it starts rejecting calls.

So the key is not just enforcing limits, it’s communicating them clearly and consistently.

There’s also the matter of responsibility.

What does the API promise to do, and what does it expect the user to handle?

If you send a POST /send-email, does a 200 OK mean the email was delivered or just that it was accepted for processing?

That distinction matters. If delivery might fail later, users need a webhook or status check to track the outcome.

If the API guarantees eventual delivery, that’s a very different contract.

Being explicit about who is responsible for what is one of the most respectful things an API can do.

And finally, there’s the reality of promises. Numbers matter. If your API usually returns in 100ms but sometimes spikes to 2 seconds, your users need to know.

If you go down once a week for maintenance, document it.

These aren’t flaws. They’re realities.

But if they’re hidden, users will experience them as surprises. And surprises kill trust.

A good API doesn’t pretend to be perfect. It tells you what it can do, what it can’t do, what might go wrong, and how you’ll know.

That’s not weakness. That’s the foundation of reliability.