The first time our automated AI system flagged a customer notice as non-compliant, everyone got excited. We had been grinding at this for months. Two runs later, it flagged the exact same notice as compliant.
Nobody said anything for a second. That moment probably taught me more about deploying AI in financial services than anything else.
Regulatory compliance in financial services is document-heavy, time-intensive, and rarely straightforward.
For years, the answer was headcount: more compliance advisors, more reviewers, more manual checks. Then Generative AI arrived and everyone wanted to move fast (our CEO/CTO wants all AI now).
Having led a team doing exactly this work at a major US financial institution, I want to offer something more useful than the usual enthusiasm: an honest account of what actually happens when you deploy GenAI for compliance in production.
In short: it works, it is worth doing, but it will take longer and cost more than your first estimate. Sorry.
The Tool We Used and Why
Our team used Meta’s Llama with Scout capabilities, leveraging its combined text and vision features to process financial documents. The vision capability was particularly important for our use case because compliance documents in financial services are not always clean, structured text files.
They include PDFs with variable formatting, scanned notices, letters with embedded tables, and documents that can look very different from one another even when they serve the same regulatory purpose.
Llama Scout’s ability to handle both text extraction and visual document understanding in a single model simplified our architecture considerably. We did not need separate pipelines for typed versus scanned content, which matters when you are operating at enterprise scale with thousands of documents flowing through a system.
Besides the reasons above, it was also the only model approved internally for GenAI experimentation.
That said, choosing the right model is only the beginning of the problem.
How We Actually Built It
Since people always ask: no, we did not use RAG. While it is often the default approach for LLM-based document workflows, our use case did not require a retrieval layer. The regulatory requirements we were checking against were not a giant corpus of unknowns. They were specific, known, well-defined attributes. So we took a different approach.
We broke each regulatory requirement down into a discrete, testable question. Something like: does this adverse action letter contain a FICO score? That single question maps to a specific regulatory requirement. We wrote a prompt for it. Then another question for the next requirement. Then another. Each prompt was scoped to one clear intent rather than asking the model to evaluate compliance broadly. Broad questions get broad answers. In compliance, that is not useful.
Architecture Diagram
This approach meant more prompts to manage but much more reliable outputs. The model is not being asked to reason across ten requirements at once. It is being asked one specific thing and expected to answer it with evidence. That structure also proved important for evaluation.
On the document side, we converted everything to images first, including PDFs. Non-image documents were converted and then compressed to keep file sizes manageable without losing the detail the model needed to read.
Each page was treated as a separate image. We extracted text from each page individually and then combined the full page sequence to give the model complete document context before asking it anything. That pipeline handled much of the variability problem because we stopped fighting document formats and instead converted everything into a consistent format.
For evaluation, we measured model output against what a human reviewer would produce, including their justification and the specific evidence they cited. We were not only measuring whether the answer was correct, but whether the reasoning aligned.
A model that gets the right answer for the wrong reason is a problem waiting to happen in a regulated environment. We caught several of those cases early, and it changed how we thought about prompt design going forward.
Lesson One: Context Is Everything
Here is the thing nobody tells you early enough: context is basically everything. A model that does not understand the regulatory framework it is working in will give you answers that sound completely reasonable and are completely wrong. Fluent nonsense, basically.
In practice, that means doing a lot of prompt engineering and building your context layer before you touch production code. Your model needs to know not just what the document says, but what it should say, under which state law, and under which version of that law.
We spent far more time on the context layer than on the model itself. Honestly, more than I expected going in. There is an instinct to treat model selection as the hard problem. It is not.
The hard problem is making sure the model knows enough about your specific regulatory world to be useful, and that requires compliance experts in the room, not just engineers.
Here is an example of the glamorous work me and my team did:
System: You are a compliance document reviewer for financial services regulatory requirements.
Document: [Full document text]
Question: Does this adverse action letter contain a FICO score as required by [Regulation X]?
Answer yes or no, then provide:
- Your justification
- The specific text evidence from the document
Lesson Two: Document Variability Will Break Your First Pipeline
In a controlled demo, your GenAI system will perform beautifully. Feed it a clean, well-formatted document and it will extract the relevant clauses accurately. Then you move to production (you know what’s coming, right?).
Production documents come from dozens of sources, in dozens of formats. Some are scanned at an angle. Some have handwritten annotations. Some have tables nested inside tables. A model that handles 95% of documents correctly in testing will encounter the difficult 5% continuously in production, because volume is high and edge cases accumulate quickly. We learned this the hard way and then spent significant time rebuilding our preprocessing pipeline to handle what the real world actually sends you.
The practical solution is building stronger preprocessing into your pipeline and investing in evaluation infrastructure early. You need to know, for every document that flows through your system, whether the model’s output was correct.
Lesson Three: You Need Ground Truth Before You Need a Model
Nobody talks enough about evaluation. How do you actually know if the model is right? In most applications, that question is annoying but manageable. In compliance, it matters legally. You cannot just eyeball a sample and move on.
We had to build ground truth before we could run real experiments. A set of documents with known correct answers, verified by actual compliance experts. It took time (and I hated that). It was not the kind of work that makes anyone’s highlight reel. But without it, we had no way to tell whether the model was getting better or just getting more confident while being wrong. Those two things can look identical from the outside.
Budget for this. Seriously. It is not optional.
And honestly, the same thing I said about context earlier applies here too. The work that does not feel like AI work is where most of your time actually goes: building ground truth, wrangling data approvals, writing evaluation scripts. Not the model. The stuff around the model.
Lesson Four: Data Security Approvals Will Take Longer Than You Expect
This one catches nearly every team off guard. In a regulated financial institution, you cannot simply connect a model to production data and start experimenting. Every data access request goes through a formal approval process. In our case, that meant PSA (Privacy and Security Assessment) approvals before we could work with any data that touched real customer or regulatory information.
I will be straight with you: the approval process was genuinely frustrating. We had engineers ready to go, and we were waiting weeks. But here is the thing. These controls exist because financial data is sensitive in ways that actually matter to real people. When I stopped treating the approvals as obstacles and started treating them as part of the job, the whole thing got easier. Your timeline needs to include them. Not as a risk item. As a given.
We worked with synthetic and anonymized data during early development and built our roadmap around the reality that production data access takes time. If your timeline does not account for security approval cycles, your timeline is optimistic.
If you are leading an enterprise AI initiative in a regulated industry and your timeline does not account for data security approval cycles, your timeline is wrong.
The Biggest Misconception: GenAI Cannot Do Everything
The biggest mistake I see is people expecting full autonomy. The pitch sounds clean: feed the model the state laws, connect it to your document systems, let it tell you what is compliant. Done. No more compliance reviewers sitting in a room reading notices all day. Sounds like a fairy tale, right?
Reality is messier. Take a simple case. A customer notice has certain wording. A state law requires disclosure in a specific way. A compliance expert with ten years of experience reads both and makes a call, drawing on case history, regulatory context, and honestly some gut instinct built up over years. That judgment is not something you can just prompt your way to.
What the model can do is flag issues, surface relevant regulatory text, and cut down the volume of documents a human has to read cover to cover. That is genuinely valuable and nothing to dismiss. But making the final call on an ambiguous case? Not there yet. Maybe not ever for certain categories of decision.
Teams that go in expecting full autonomy either end up with a system that makes wrong calls nobody catches, or they add so many guardrails that the whole thing slows down to roughly the speed of the manual process it was supposed to replace. Neither outcome is good
What Actually Changes When You Get This Right
When it goes right, though, the impact is real. Not hype-real. Actually real, and everyone loves it.
Reviewers stop reading every document and start focusing on flagged cases. Turnaround drops. There are other benefits too, consistency being a big one. Honestly, I could write a whole other piece on just the operational side of what changes. Maybe I will at some point.
The business case is there. I have seen it.
The organizations that get the most out of this are not the ones that ship fastest. They are the ones that do the boring foundational work first and then ship something that actually holds up.
Move carefully. It is faster in the long run.
__
Join us in New York on June 9 at CDAO New York to explore these themes and more, alongside senior leaders shaping the future of data, analytics, and AI.