Every ERP vendor shipped an “AI copilot” in the last twelve months. Most of them are a chat widget that sends your general ledger to OpenAI and hopes for the best. That wasn't going to work for us — our median customer has revenue they consider private, payroll they consider sensitive, and legal teams who consider “shared with a third-party model” a dealbreaker.
So Billin AI had three hard constraints before we wrote a single line of inference code:
- No tenant data ever leaves our infrastructure — not even pseudonymized, not even for fine-tuning.
- Every answer is grounded in the user's own tenant, with row-level ACLs that respect the asking user's permissions.
- The whole thing runs on commodity hardware. We're not paying H100 prices per tenant.
This post is a walkthrough of what we built, what we discarded, and what we'd do differently if we started over.
The architecture in one diagram
At the top level, Billin AI is three layers: a retrieval layer that reads tenant data through ACLs, a model router that picks the smallest model that can answer the question, and a verifier that checks the answer against the source rows before anything leaves the system.
The router is the piece people usually get wrong. Most teams default to the biggest model they can afford. We default to the smallest one that can provably answer the question — and only escalate if the verifier rejects the result.
Retrieval with ACLs baked in
Row-level security in Postgres is older than most AI teams know. Every query Billin AI issues goes through the same RLS policies a human user would — if a bookkeeper can't see the CEO's salary row, neither can the AI running on her behalf.
This sounds obvious but it's the single biggest mistake we saw other vendors make. If your retrieval layer bypasses RLS “because it's the AI,” you've built a leak.
-- every AI query runs as the asking user's role SET LOCAL ROLE $asking_user_role; SET LOCAL app.current_tenant = $tenant_id; SELECT entry_id, debit, credit, memo FROM ledger_entries WHERE period = '2026-Q1' AND ai_redact(memo) = memo; -- column-level redaction
Why we run smaller local models for 80% of requests
We benchmarked 14 models over eight weeks on real accounting tasks — reconciliation, categorization, anomaly spotting, NL-to-SQL. The punchline: a 7B model fine-tuned on accounting vocabulary beats a 70B general model on 9 of the 12 tasks that make up the bulk of our traffic.
The question is never “how good is the model.” It's “how good is the model on the task you actually have, grounded in the context you actually have.”— Internal eval memo, week 6
The router looks at the classified intent and picks:
- Tier 0 — rule engine. Reconciliation with obvious matches (same amount, same date, same reference). ~34% of all AI calls never touch a model.
- Tier 1 — local 7B. Categorization, anomaly flags, short NL-to-SQL. ~48% of calls.
- Tier 2 — local 34B. Multi-step reasoning, forecast narration, longer SQL. ~15% of calls.
- Tier 3 — hosted frontier. Only customers who've explicitly opted in, for the gnarliest cases. ~3% of calls.
“Smallest model that can provably answer” — that's the single most important sentence in our architecture doc.
The evaluation harness
We have 4,200 golden examples, drawn from anonymized real tickets, that run on every deploy. If a model change moves accuracy on any single category by more than 0.8 percentage points, the deploy is blocked and a human signs off.
- 1,800 reconciliation pairs with labeled match/no-match/uncertain outcomes
- 900 NL-to-SQL prompts with graded SQL outputs
- 700 narration prompts with factual-accuracy scores from a finance reviewer
- 400 anomaly-detection cases with human ground truth
- 400 adversarial prompts designed to elicit hallucinated numbers
What this looks like in production
- 2.1M inference calls across 1,240 tenants
- 78% completed in under 50ms (Tier 0 + Tier 1)
- 3.8% rejected by the verifier and re-run or escalated
- Zero data-egress incidents — we log every byte leaving our VPC
What we'd redo
1. We under-invested in the eval harness early. For the first three months we were shipping model changes on vibes. The harness was two weekends of engineering time that would have saved us two months of customer complaints.
2. We started with vector search everywhere. It turns out the majority of our queries are better served by deterministic SQL with a small amount of natural-language slot-filling.
3. We should have shipped the verifier before the model. The verifier is the thing that makes an LLM safe enough to put in front of accountants. Everything else is just plumbing around it.
If this kind of thing is your flavor of fun, we're hiring two more people on the AI platform team. Come build the part of the product that has to get every number right.