Skip to content
mcp-pay View on GitHub

← Back to writing

Designing payment flows that don't break the LLM call

Neul Labs · ·
protocolagents

The default HTTP payment flow assumes someone has time. A 402 redirects to a checkout page; a human reads a form; a credit card or bank account ferries authorization back; the resource finally arrives a few seconds later. For an MCP server being called by an LLM agent in the middle of a planning loop, “a few seconds” is the difference between “tool used” and “tool skipped.”

The mcp-pay v0.1 reference flow is six messages. That is the budget. If we can’t get from the first request to the final response in under that, the flow is too expensive to be embedded inside a normal tool call. So a lot of the design effort went into making those six messages cheap.

The six messages, in order

1.  Agent → Server      GET /api/forecast
2.  Server → Agent      402 Payment Required
                        X-PAYMENT-REQUIRED: {…}
3.  Agent → Server      GET /api/forecast
                        X-PAYMENT: {…}
4.  Server → Facilitator POST /verify
5.  Facilitator → Server { valid: true }
6.  Server → Agent      200 OK
                        X-PAYMENT-RESPONSE: {…}
                        { forecast: … }

Several things to notice.

The 402 carries everything the agent needs in X-PAYMENT-REQUIRED. There is no separate “negotiation” step where the agent asks for a quote and the server responds — the quote is the 402. That collapses two messages into one and removes a round trip from the critical path. If the agent already fetched /.well-known/mcp/pay.json during MCP discovery, the 402 is mostly a confirmation: the price the server is quoting now should match the price in the manifest. If it doesn’t, the agent has grounds to refuse the call without paying.

The agent’s payment proof rides as X-PAYMENT on a normal retry of the same request. There is no redirect. No new URL. The server gets the same handler invoked twice, the second time with a header that proves payment. From the server framework’s perspective — axum in our reference — this is the same route. The 402 path and the 200 path share the same business logic.

Verification happens server-to-facilitator, not agent-to-facilitator. The agent never talks to the facilitator. That matters for two reasons. First, the agent does not need to discover the facilitator URL — the server already knows it. Second, the verification message lives outside the critical user-facing path; if the facilitator is slow, the agent only sees one slow tool call, not three.

Finally, the response carries X-PAYMENT-RESPONSE alongside the actual result. The agent gets the data it asked for and the settlement metadata it needs in the same body. There’s no follow-up call to learn what happened. The flow ends with one message.

Why we resisted adding a “session” concept

A reasonable instinct is to make the payment proof reusable. An agent that pays once should not need to pay again for the same minute, or the same plan, or the same session. We thought hard about it and pushed it out of v0.1.

The reason is composition. The moment a payment is “valid for the next thirty seconds,” every component that touches the request has to know about that window. The server has to track sessions. The agent has to remember which proofs are still good. The facilitator has to expose a query API. Caches in the middle have to respect a new TTL semantic. The flow goes from “stateless 402 + header” to “stateful agreement with side state.” That is the kind of complexity that kills adoption.

A rail-side feature can do better. x402 facilitators can already issue receipts that are reusable for a window, if both sides agree. Lightning has its own credential semantics. mcp-pay leaves that decision to the rail and stays out of it. The trade-off is that an agent calling the same paid tool five times pays five times — but it pays five times of $0.003, and the round trips are short. The savings are not worth the complexity.

Latency budget in practice

A normal MCP tool call has a soft budget of a few hundred milliseconds. The agent will call several tools in parallel, and its planner stalls until enough of them return for the next step. Adding payment must not blow that budget out by more than a small multiple, or paid tools get deprioritized.

Three things make the budget achievable.

The 402 response is cheap. The server does no work — it didn’t process the request — so the 402 is mostly headers. Round-trip cost is one TCP round trip plus a small amount of server CPU.

Payment proof construction is local. For x402, the agent already has a USDC balance on Base; constructing a proof is a signature, not a network call. For Lightning, it is an invoice fetch, which is one round trip to the agent’s own Lightning node. Neither involves the server.

Verification is non-blocking from the agent’s perspective. It is a round trip the agent never sees. The server holds the connection open, calls the facilitator, and only responds once verification returns. From the agent’s point of view, the entire flow is “request, 402, request, response” — two requests, two responses. The verification message lives off the critical path.

In the reference server, the entire flow on Base USDC sits at roughly two seconds end-to-end, dominated by on-chain settlement, with the facilitator round trip a small fraction of that. Lightning is similar in practice. The cost is not free, but it is within the budget of any tool that does real work — and tools that don’t do real work probably shouldn’t be paid in the first place.

The 402 carries enough to be self-describing

A lot of design hours went into the X-PAYMENT-REQUIRED payload. The earliest sketches tried to keep it small by referring back to the manifest — “see /.well-known/mcp/pay.json for the rules.” We removed that. The 402 has to be self-describing.

The reason is the long tail of intermediaries. Proxies, gateways, and edge functions sit between the agent and the server. A 402 that requires a separate fetch to another path is a 402 that fails in any environment where the second fetch is not authorized, not cached, or rate-limited. So the 402 carries the same rail information the manifest does, in compact form, for the call being attempted.

The redundancy is a feature. The manifest is the long-lived static description of what the server charges. The 402 is the short-lived dynamic confirmation for one specific call. An agent that already has the manifest can verify the two match. An agent that does not can still proceed.

What we learned writing the reference server

Three things were unexpected during implementation.

Verification timing matters more than rail. Whether the server talks to a Coinbase facilitator over HTTPS or to a local Lightning node over a Unix socket, the only thing that affects end-to-end perceived latency is the time the facilitator takes to confirm. Picking facilitators with low p99 confirmation latency is the most leveraged thing a paid MCP server can do.

axum’s middleware model fits the flow perfectly. A payment-required middleware can short-circuit a request that lacks X-PAYMENT, dispatch verification when one is present, and pass through to the route handler when verification succeeds. The route doesn’t need to know it’s gated. That keeps tools cleanly testable.

The hardest part is the schema, not the server. Writing a server that returns 402 is easy. Writing a manifest schema that compresses six rails into one consistent shape, without forcing every server to fill in fields that don’t apply to its rail, is where most of the spec work went. v0.1 leans heavily on optional fields and model discriminators to make that work.

What a “broken” call looks like

The other side of “flows that don’t break the LLM call” is graceful failure. The mcp-pay spec is explicit: if the facilitator can’t verify, the server returns a 402 with a reason. If the rail itself is down, the server returns a 503 with the rails still up listed in the response. The agent can route around. The plan does not stall.

That is the whole goal. Charging an agent is fine. Stranding it is not.