A Paperclip upgrade broke every agent wake in our OpenClaw setup.
At first glance, it looked ambiguous.
Was OpenClaw sending the wrong thing to Paperclip? Was Paperclip sending the wrong thing to OpenClaw? Was this a bad deploy, a stale UI version, an undocumented upgrade requirement, or a genuine upstream regression?
It turned out to be the last one.
The fix was much smaller than the investigation.
The symptom
After upgrading Paperclip, agent runs started failing at the OpenClaw gateway boundary.
The key error was:
invalid agent params: at root: unexpected property 'paperclip'
That immediately narrowed the problem down.
OpenClaw was rejecting a request shape. So the question became: who was sending that shape?
The answer was Paperclip’s openclaw_gateway adapter.
It was building outbound agent params that included a root-level paperclip object, and our current OpenClaw gateway schema rejected unknown root fields.
The first important lesson: don’t trust surface signals too quickly
This incident had a few easy traps.
1) The Paperclip UI version label was misleading
The UI still showed 0.3.1, even after a newer image was live.
That could have sent us chasing the wrong problem, like assuming the deploy failed.
It hadn’t.
The safer verification path was:
- check the live container image digest,
- inspect OCI revision labels,
- inspect the running container filesystem,
- and hit the health endpoint directly.
If you run self-hosted systems, this matters.
UI metadata is not deployment truth. Container/image metadata is.
2) Release notes were not enough
We checked release notes and upgrade-guide material.
There was no useful warning that said:
- OpenClaw compatibility changed,
- OpenClaw must be upgraded first,
- or a root-level
paperclippayload field had been introduced.
So this was not a case of “we skipped the obvious documented step.”
3) latest was not automatically safe
We redeployed current upstream ghcr.io/paperclipai/paperclip:latest to test whether master had already fixed the issue.
It had not.
The bad behavior was still present in source and in runtime.
What was actually happening
The direction of the bug was:
Paperclip -> OpenClaw
Paperclip was sending agent params shaped like this:
{
"message": "...",
"sessionKey": "paperclip",
"idempotencyKey": "...",
"paperclip": {
"runId": "...",
"companyId": "...",
"agentId": "...",
"agentName": "..."
}
}
That root-level paperclip field was the problem.
OpenClaw’s agent param validation rejected it.
And because validation failed at the gateway boundary, agent wakes failed before useful work even started.
The key debugging move: verify the exact request contract
The most useful step was not reading release notes.
It was verifying both sides of the contract directly:
- inspect the outbound payload Paperclip was actually sending,
- inspect the local OpenClaw validation behavior,
- confirm whether the mismatch was real and reproducible.
That turned the problem from a vague integration failure into a concrete schema mismatch.
Once we had that, the repair options became much cleaner.
We found the likely regression commit
Upstream history pointed to a clear story:
- an earlier fix had removed the bad root-level
paperclipproperty, - later, commit
91e040a69687f667874869d77766b407353196c8reintroduced it, - newer PRs were effectively trying to restore the previous behavior.
That mattered because it changed the strategy.
This did not look like a broad architectural incompatibility.
It looked like a small regression with a small fix.
Why we did not patch OpenClaw
We had three practical choices:
- patch OpenClaw to accept the extra field,
- roll Paperclip back to an older pre-regression version,
- patch current Paperclip to stop sending the bad field.
We chose option 3.
Why:
- the bug originated on the Paperclip side,
- the OpenClaw-side patch would widen validation for a field we did not actually want,
- rolling back would drop unrelated upstream improvements,
- and the Paperclip-side fix was tiny and easy to remove later.
When an integration breaks, the best fix is usually the one with the smallest blast radius at the layer that introduced the incompatibility.
That was Paperclip.
Why we did not do a long-term fork
We also chose not to create and maintain a permanent fork.
Instead, we:
- cloned current Paperclip master locally,
- applied a tiny patch,
- built and deployed that to the existing CapRover app,
- and kept the plan to move back to official upstream images once the upstream PR lands.
That let us restore service quickly without signing up for long-term fork maintenance.
This is a useful pattern:
temporary forward patch > broad rollback > permanent fork
when the upstream fix is already conceptually known and the diff is tiny.
The actual workaround
The workaround was almost comically small.
In packages/adapters/openclaw-gateway/src/server/execute.ts, we removed the bad root-level field from outbound agent params.
We ended up with:
delete agentParams.text;
delete agentParams.paperclip;
instead of sending:
agentParams.paperclip = paperclipPayload;
After deploying that patched build, Paperclip -> OpenClaw wakes started working again.
That was the production recovery.
The second important lesson: a good workaround is easy to undo
The reason this workaround was attractive is that it had a very low cleanup cost.
It did not require:
- a database rewrite,
- a config migration,
- a protocol redesign,
- or a shadow compatibility layer.
It was a small compatibility patch on top of current upstream code.
That means the exit path is clean:
- wait for upstream to merge the fix,
- deploy official
ghcr.io/paperclipai/paperclip:latest, - verify health and one real wake,
- delete the temporary patch from our process.
Good incident work is not just about restoring service.
It is also about minimizing future cleanup.
One more gotcha: image rollback stopped being purely image-only
During the patched deploy, Paperclip applied several pending migrations.
That did not block the recovery, but it changed the rollback posture.
Before that point, an image rollback was simpler. After that point, any rollback needed a little more care because runtime state had moved forward too.
This is another useful operational rule:
capture rollback artifacts before the next deploy, not after.
We had already snapshotted service/container state and health before redeploying, which made this much less stressful.
What I would recommend to anyone running self-hosted agent stacks
If you hit an upgrade break like this, here is the checklist I would use:
1) Verify the live artifact, not the UI label
Check:
- image digest,
- OCI revision,
- live container filesystem,
- health endpoint.
2) Inspect the exact request/response contract
Do not stop at log summaries. Capture the actual payload shape on the failing boundary.
3) Figure out which side owns the mismatch
Patch the side that introduced the incompatibility unless there is a strong reason not to.
4) Prefer a tiny forward patch over a messy rollback
Especially when:
- the offending diff is known,
- upstream is already discussing the fix,
- and your patch is easy to remove later.
5) Treat release notes as hints, not proof
Documentation can lag reality. Runtime validation is the source of truth.
Final takeaway
This incident looked bigger than it was.
The investigation touched release notes, container metadata, upstream issues, source history, health checks, CapRover deploys, and request-schema validation.
But the fix itself was simple:
- identify the exact incompatible field,
- patch the producer,
- stay close to upstream,
- and keep the escape hatch back to official images.
That is usually the right posture for self-hosted AI infrastructure.
Not panic. Not a giant fork. Not speculative rewrites.
Just tight diagnosis, small diff, fast recovery, clean exit.