Building Billin AI — Billin Blog

Every ERP vendor shipped an “AI copilot” in the last twelve months. Most of them are a chat widget that sends your general ledger to OpenAI and hopes for the best. That wasn't going to work for us — our median customer has revenue they consider private, payroll they consider sensitive, and legal teams who consider “shared with a third-party model” a dealbreaker.

So Billin AI had three hard constraints before we wrote a single line of inference code:

No tenant data ever leaves our infrastructure — not even pseudonymized, not even for fine-tuning.
Every answer is grounded in the user's own tenant, with row-level ACLs that respect the asking user's permissions.
The whole thing runs on commodity hardware. We're not paying H100 prices per tenant.

This post is a walkthrough of what we built, what we discarded, and what we'd do differently if we started over.

The architecture in one diagram

At the top level, Billin AI is three layers: a retrieval layer that reads tenant data through ACLs, a model router that picks the smallest model that can answer the question, and a verifier that checks the answer against the source rows before anything leaves the system.

The router is the piece people usually get wrong. Most teams default to the biggest model they can afford. We default to the smallest one that can provably answer the question — and only escalate if the verifier rejects the result.

Retrieval with ACLs baked in

Row-level security in Postgres is older than most AI teams know. Every query Billin AI issues goes through the same RLS policies a human user would — if a bookkeeper can't see the CEO's salary row, neither can the AI running on her behalf.

This sounds obvious but it's the single biggest mistake we saw other vendors make. If your retrieval layer bypasses RLS “because it's the AI,” you've built a leak.

-- every AI query runs as the asking user's role
SET LOCAL ROLE $asking_user_role;
SET LOCAL app.current_tenant = $tenant_id;

SELECT entry_id, debit, credit, memo
FROM   ledger_entries
WHERE  period = '2026-Q1'
AND    ai_redact(memo) = memo;  -- column-level redaction

Gotcha we hit on day 47: Postgres' statement-level triggers don't fire during RLS policy evaluation. If you're relying on triggers for audit logging of AI reads, you need row-level triggers or you'll miss 100% of them.

Why we run smaller local models for 80% of requests

We benchmarked 14 models over eight weeks on real accounting tasks — reconciliation, categorization, anomaly spotting, NL-to-SQL. The punchline: a 7B model fine-tuned on accounting vocabulary beats a 70B general model on 9 of the 12 tasks that make up the bulk of our traffic.

The question is never “how good is the model.” It's “how good is the model on the task you actually have, grounded in the context you actually have.”— Internal eval memo, week 6

The router looks at the classified intent and picks:

Tier 0 — rule engine. Reconciliation with obvious matches (same amount, same date, same reference). ~34% of all AI calls never touch a model.
Tier 1 — local 7B. Categorization, anomaly flags, short NL-to-SQL. ~48% of calls.
Tier 2 — local 34B. Multi-step reasoning, forecast narration, longer SQL. ~15% of calls.
Tier 3 — hosted frontier. Only customers who've explicitly opted in, for the gnarliest cases. ~3% of calls.

“Smallest model that can provably answer” — that's the single most important sentence in our architecture doc.

The evaluation harness

We have 4,200 golden examples, drawn from anonymized real tickets, that run on every deploy. If a model change moves accuracy on any single category by more than 0.8 percentage points, the deploy is blocked and a human signs off.

1,800 reconciliation pairs with labeled match/no-match/uncertain outcomes
900 NL-to-SQL prompts with graded SQL outputs
700 narration prompts with factual-accuracy scores from a finance reviewer
400 anomaly-detection cases with human ground truth
400 adversarial prompts designed to elicit hallucinated numbers

What this looks like in production

41ms

P50 latency

$0.003

avg cost / call

rows sent to 3rd party

2.1M inference calls across 1,240 tenants
78% completed in under 50ms (Tier 0 + Tier 1)
3.8% rejected by the verifier and re-run or escalated
Zero data-egress incidents — we log every byte leaving our VPC

What we'd redo

1. We under-invested in the eval harness early. For the first three months we were shipping model changes on vibes. The harness was two weekends of engineering time that would have saved us two months of customer complaints.

2. We started with vector search everywhere. It turns out the majority of our queries are better served by deterministic SQL with a small amount of natural-language slot-filling.

3. We should have shipped the verifier before the model. The verifier is the thing that makes an LLM safe enough to put in front of accountants. Everything else is just plumbing around it.

If this kind of thing is your flavor of fun, we're hiring two more people on the AI platform team. Come build the part of the product that has to get every number right.

#ai#postgres#retrieval#security#llm-eval

Núria Roca

Principal engineer · AI platform · Barcelona

Núria leads the team building Billin's retrieval and inference layer. Before Billin, she worked on search infrastructure at Algolia and Factorial. She writes about infrastructure about once a quarter.

Building Billin AI: the retrieval architecture behind a tenant-grounded LLM.

The architecture in one diagram

Retrieval with ACLs baked in

Why we run smaller local models for 80% of requests

The evaluation harness

What this looks like in production

What we'd redo

Keep reading

We run the whole ledger on one Postgres. Here's how.

Multi-currency accounting without tears: our ledger invariants.

The API-first principle: every UI feature ships the endpoint first.