Claude Files API – blogs.jutsu.ai

Here’s what changed when I switched from inline text blobs to the Claude Files API—and why I kept it in production:

Files API replaced my 40KB inline blobs with reusable file IDs across requests
Citation grounding cut hallucinated quotes to near zero in 200 test runs
Cache reuse on a 90KB contract saved 11 seconds per follow-up question
Cleanup cron deletes orphaned files after 7 days so storage stays flat

I moved my document pipeline from inline text blobs to the Claude Files API, and follow-up latency fell from 14 seconds to 3. The bigger win was citation grounding: instead of paraphrasing clauses and getting them slightly wrong, Claude now quotes the exact line with a reference. Below are the five patterns I run in production, with the numbers that earned each one a permanent spot. Start with one, measure, then layer in the rest. Make progress visible.

Pattern 1: Upload Once, Reference By ID

Before the Files API, I put document text directly into the messages array. A 40KB PDF became 40KB of inline content on every request. Five follow-up questions meant sending that 40KB five times. It was wasteful and bloated prompts in ways that made debugging painful.

The Files API fixes this. Upload once, get a file ID, and reference that ID in the content block. The upload is a multipart POST to the files endpoint, which returns an ID like file_abc123 that lives on Anthropic’s side.

file = client.beta.files.upload(
    file=("contract.pdf", open("contract.pdf", "rb"), "application/pdf")
)
# later, in a message
content = [{"type": "document", "source": {"type": "file", "file_id": file.id}}]

In practice, a flow that used to send 38KB per request now sends a ~30-character ID. Over a session with eight questions on one document, that’s 304KB of redundant payload I no longer push across the wire.

Important scoping detail: file IDs are scoped to your organization, not a single conversation. You can upload in one request and reference it an hour later in another. Track which IDs belong to which user or you’ll leak document access across sessions. I store a tiny SQLite mapping: file ID, user ID, upload timestamp, TTL. That table is the spine for everything else here.

If you’re wiring this into a larger agent, IDs play nicely with tool loops—you can pass file IDs between tools without re-serializing the document. For broader request scaffolding, I lean on the Claude Blueprint, which shows how it fits together.

Pattern 2: Citation Grounding That Actually Cites

The feature that justified the whole migration was citations. Attach a document and enable citations on the document block; Claude returns structured references pointing to the exact span of text it used. No more “the contract says you can cancel anytime” when it actually says you can cancel within 30 days with written notice.

content = [{
    "type": "document",
    "source": {"type": "file", "file_id": file.id},
    "citations": {"enabled": True}
}]

The response includes citation objects with cited text and location. I render these as footnotes that link back to the source span. Trust improved immediately—people can click and check.

I tested 200 questions against 12 contracts before and after enabling citations. Without grounding, 34 answers contained a quote that didn’t appear verbatim in the source. With citations enabled, that dropped to 2. Both were cases where the model summarized across two clauses rather than fabricating. That’s a 94% reduction in the failure mode I cared most about.

Two practical notes:

Citations require real text. A scanned image PDF with no text layer won’t help. I run OCR first for those.
Citations add tokens. On long docs with many citations, my output token count rose ~20%. I budget for it because the verification benefit is worth it.

Pattern 3: Multi-File Context Without the Mess

Real workflows aren’t one-file affairs. Think: an original contract, an amendment, and an email thread—then the user asks whether the amendment changes cancellation terms in the original. Inline, I used awkward concatenation with delimiter headers and hoped the model respected them.

With file IDs, I attach multiple document blocks in one message. Each keeps its identity, and citations point back to the correct source file. Claude can say “the original says 30 days (file A) but the amendment extends this to 60 days (file B).” That cross-file reasoning is what people pay for.

content = [
    {"type": "document", "source": {"type": "file", "file_id": original.id}, "citations": {"enabled": True}},
    {"type": "document", "source": {"type": "file", "file_id": amendment.id}, "citations": {"enabled": True}},
    {"type": "text", "text": "Does the amendment change the cancellation terms?"}
]

In practice, I cap each request at 8 files. Beyond that, I do a retrieval pass to pick what’s relevant, then attach only those. For a 40-document case file, sending all 40 every time is slow and expensive. I run a cheap embedding search to find the top 6, attach those, and let citations confirm the model used the right ones.

The numbers matter. A 3-file request with two long contracts and an email thread runs around 90KB of underlying document content. Sending file IDs keeps my request payload tiny while Claude still has full access to all three. Combined with caching (next pattern), follow-ups on that same set run in a fraction of the first-request latency.

If you’re building agents that juggle many documents across steps, the file-ID handoff between tools is the unlock. I go deeper on orchestration in Claude Agent SDK in Production, which pairs well with the file patterns here.

Pattern 4: Cache Reuse Across Requests

This is where latency wins compound. Anthropic’s prompt caching lets you mark a portion of the prompt as cacheable; subsequent requests that share that prefix read from cache instead of reprocessing it. When the cached portion is a large document, the savings are dramatic.

I attach a long document, add a cache control marker, and the first request processes the whole thing. On every follow-up against the same document within the cache window, the document tokens come from cache. On a 90KB contract, my first request took 14 seconds; the second question, hitting cache, came back in 3 seconds. That 11-second gap is the difference between sluggish and instant.

content = [{
    "type": "document",
    "source": {"type": "file", "file_id": contract.id},
    "cache_control": {"type": "ephemeral"}
}]

Tips that kept my hit rate high:

Keep document blocks at the front of the content array and the question at the end; the cache matches a shared prefix.
The cache matches on exact prefix. If you reorder document blocks, you’ll miss. I sort attached files deterministically (by file ID) before building the request. That one line lifted my cache hit rate above 80% (versus ~50% when ordering varied).
Cache is keyed on content, not user. Two users asking about the same public document can share a cache hit. For private documents, file IDs differ per upload, so there’s no cross-user leakage through cache. I verified this before shipping.

Pattern 5: Cleanup So Storage Does Not Rot

Uploaded files persist until you delete them. That’s great for reuse and terrible if you ignore it. In month one, I uploaded 1,400 files and deleted zero. That’s a mess waiting to become a problem.

I run a nightly cron with three steps: (1) read the SQLite mapping and find file IDs whose TTL has passed (default 7 days), (2) call the delete endpoint for each, (3) remove the row so local state and Anthropic’s state stay in sync.

expired = db.query("SELECT file_id FROM files WHERE uploaded < datetime('now', '-7 days')")
for row in expired:
    client.beta.files.delete(row["file_id"])
    db.delete_file(row["file_id"])

I also reconcile weekly: list all files from the API and compare against my table. Any file the API knows about that my table doesn’t is an orphan (often from a crashed upload). I delete those too. After I added reconciliation, the file count stabilized around 90–120 active files instead of climbing.

The 7-day default came from real usage. Most users finish with a document within a day, but some return midweek. Seven days covers the long tail without hoarding. For documents tied to a paid case a user might revisit, I bump TTL to 30 days and flag them so cleanup skips them—one boolean column in the same table. The result: storage stays flat week over week because cleanup is automatic.

Bottom Line

The Files API changed how I build document workflows in five concrete ways: upload once and reference by ID, ground every answer with real citations, attach multiple files for cross-document reasoning, cache long documents for instant follow-ups, and automate cleanup so storage never rots. The combined effect: follow-up latency dropped from 14 seconds to 3, fabricated quotes fell 94%, and the file count holds steady instead of growing forever.

None of these patterns are hard on their own. The value comes from running all five together because they reinforce each other. File IDs make caching possible, caching makes multi-file requests affordable, and cleanup keeps the whole thing sustainable. If you’re starting from inline text blobs, migrate the upload pattern first, then layer in citations, then cache, then cleanup last.

For the full request scaffolding and how these pieces fit into a larger agent loop, the Claude Blueprint walks through the setup end to end. Build the small version first, measure your own latency, and add patterns as your document volume grows.

Reference: View article

Tag: Claude Files API