Blog/Engineering

Building Billin AI: the retrieval architecture behind a tenant-grounded LLM.

We built an AI layer that works on every tenant's private data without sending a single row to a model provider. Here's the stack — row-level ACLs, column redaction, per-workspace embedding caches, and why we chose smaller local models for 80% of the workload.

NR
Núria Roca
Principal engineer · AI platform
April 14, 2026
12 min read

Every ERP vendor shipped an “AI copilot” in the last twelve months. Most of them are a chat widget that sends your general ledger to OpenAI and hopes for the best. That wasn't going to work for us — our median customer has revenue they consider private, payroll they consider sensitive, and legal teams who consider “shared with a third-party model” a dealbreaker.

So Billin AI had three hard constraints before we wrote a single line of inference code:

  1. No tenant data ever leaves our infrastructure — not even pseudonymized, not even for fine-tuning.
  2. Every answer is grounded in the user's own tenant, with row-level ACLs that respect the asking user's permissions.
  3. The whole thing runs on commodity hardware. We're not paying H100 prices per tenant.

This post is a walkthrough of what we built, what we discarded, and what we'd do differently if we started over.

The architecture in one diagram

At the top level, Billin AI is three layers: a retrieval layer that reads tenant data through ACLs, a model router that picks the smallest model that can answer the question, and a verifier that checks the answer against the source rows before anything leaves the system.

RETRIEVALRow-level ACLsColumn redactionPer-tenant embeddingsSQL plannerpostgres + pgvectorROUTERIntent classifierCost estimatorModel selectionFallback chain4 model sizesVERIFIERSource citationHallucination checkNumber reconciliationHuman-in-loop flagsrejects 3.8% of runs

The router is the piece people usually get wrong. Most teams default to the biggest model they can afford. We default to the smallest one that can provably answer the question — and only escalate if the verifier rejects the result.

Retrieval with ACLs baked in

Row-level security in Postgres is older than most AI teams know. Every query Billin AI issues goes through the same RLS policies a human user would — if a bookkeeper can't see the CEO's salary row, neither can the AI running on her behalf.

This sounds obvious but it's the single biggest mistake we saw other vendors make. If your retrieval layer bypasses RLS “because it's the AI,” you've built a leak.

-- every AI query runs as the asking user's role
SET LOCAL ROLE $asking_user_role;
SET LOCAL app.current_tenant = $tenant_id;

SELECT entry_id, debit, credit, memo
FROM   ledger_entries
WHERE  period = '2026-Q1'
AND    ai_redact(memo) = memo;  -- column-level redaction
Gotcha we hit on day 47: Postgres' statement-level triggers don't fire during RLS policy evaluation. If you're relying on triggers for audit logging of AI reads, you need row-level triggers or you'll miss 100% of them.

Why we run smaller local models for 80% of requests

We benchmarked 14 models over eight weeks on real accounting tasks — reconciliation, categorization, anomaly spotting, NL-to-SQL. The punchline: a 7B model fine-tuned on accounting vocabulary beats a 70B general model on 9 of the 12 tasks that make up the bulk of our traffic.

The question is never “how good is the model.” It's “how good is the model on the task you actually have, grounded in the context you actually have.”— Internal eval memo, week 6

The router looks at the classified intent and picks:

“Smallest model that can provably answer” — that's the single most important sentence in our architecture doc.

The evaluation harness

We have 4,200 golden examples, drawn from anonymized real tickets, that run on every deploy. If a model change moves accuracy on any single category by more than 0.8 percentage points, the deploy is blocked and a human signs off.

What this looks like in production

41ms
P50 latency
$0.003
avg cost / call
0
rows sent to 3rd party

What we'd redo

1. We under-invested in the eval harness early. For the first three months we were shipping model changes on vibes. The harness was two weekends of engineering time that would have saved us two months of customer complaints.

2. We started with vector search everywhere. It turns out the majority of our queries are better served by deterministic SQL with a small amount of natural-language slot-filling.

3. We should have shipped the verifier before the model. The verifier is the thing that makes an LLM safe enough to put in front of accountants. Everything else is just plumbing around it.


If this kind of thing is your flavor of fun, we're hiring two more people on the AI platform team. Come build the part of the product that has to get every number right.