← Insights home
Webinar recap · On demand

Under the Hood: what actually makes agentic AI reliable

Couldn't make it live, or just want to go back over it? This is the recap of Martin Lukac's session. He walks through what a language model really is, the four things you wrap around it to make it reliable, and why having a tool is not the same as getting an outcome. The full recording and the slide deck are both here.

Martin Lukac CTO at Flank
Format 45-min talk + live Q&A
Recorded Wed 3 June 2026
Read time ~8 minutes
The short version Watch How AI works Why it keeps improving Tool vs. outcome Reliability by design Definitions Takeaways
In one minute

The short version

Everyone now has access to the same powerful models. So the model itself is no longer what separates a good legal AI product from a bad one. What separates them is everything you build around the model.

Martin started from first principles: what a large language model actually is (a statistical next-word predictor, nothing magical), and why that design means it optimises for plausibility rather than truth. From there he covered the four things you wrap around it (context, tools, guardrails, and human supervision) that turn a clever autocomplete into something you can trust across thousands of contracts. His honest punchline: reliability isn't a smarter model. It's the discipline of refusing to let the model decide what to skip.

Read
This recap
The highlights, the diagrams in plain text, and a glossary of every term Martin used.
Start reading ↓
Watch
The recording
The full session on video, including the live Q&A at the end.
Watch the session ↓
Download
The slide deck
The full presentation as a PDF. Share it with your team, or with IT and procurement.
Download the deck ↓
The deck

Download the full presentation

Every slide from the session: the next-word example, the capability ladder, the four wrappers, and the redundant-review diagram. Handy for briefing your IT, procurement, or board.

Download PDF
Watch

The full session

Martin's 45-minute talk, plus the live Q&A with Martin and Lily at the end. If you'd rather skim, the written recap below covers all of it.

The frame

Adoption is everywhere. Impact isn't.

The defining problem for in-house legal in 2026 isn't whether to adopt AI. It's why adoption hasn't moved the outside-counsel line.

87%
of GCs report using AI, up from 44% last year
7%
say it's actually reduced their outside-counsel spend
90%
of whether a legal AI product is good comes down to the wrappers, not the model

The gap between those first two numbers is the whole talk. Everything below is about closing it.

Part 1 · How AI actually works

What a large language model actually is

A large language model isn't anything magical. It's a statistical model of language, and at runtime all it does is predict the next word, one at a time, in a loop. That really is the core trick. Everything else is engineering you build around it.

It learned to do this by reading an internet-scale pile of text: books, code, contracts, the web. The bigger and better that pile, the better the model gets. So it isn't looking an answer up anywhere. It's generating the most plausible continuation of whatever you gave it.

It's predicting one word at a time

"This agreement shall be governed by the laws of "

England
61%
New York
23%
the State
9%

The model doesn't pick one word. It produces a whole distribution of plausible next words and samples from it. With nothing else to go on, both "England" and "New York" are reasonable, so it's really just guessing. (And technically it predicts tokens, which are sub-word chunks, rather than whole words.)

Part 1 · How AI actually works

How it learned, in two phases

01 · Pre-training
Read everything
It reads everything and adjusts billions of weights until its predictions match what humans actually wrote. This is the expensive part, hundreds of millions of dollars, which is why it's so hard to enter this business as a newcomer. What you get out is a "base model." It isn't learning facts here. It's learning word probabilities.
02 · Post-training
Teach it to be useful
This is where the model gets taught to be useful: follow instructions, decline harmful requests, take on a personality. Humans rate its answers and it gets nudged toward the good ones. It's also where reasoning and tool use get baked in, and where most of the recent capability jumps have come from, not from simply building bigger models.
The point worth remembering

The leaps you've seen lately come from that second phase, the training that gives a model its judgment and its skills, not from making the model bigger.

Part 1 · How AI actually works

Three consequences you can't ignore

Because of the way it's trained, every model has three hard limits baked in. None of these are bugs. They're just properties of how the thing works.

No real-time knowledge
It was trained on a snapshot of the world. Ask it about yesterday and it either guesses or admits it doesn't know. Getting a model to update its own weights from the live world is one of the hardest open problems in the field.
🗜️
No database lookup
It doesn't retrieve facts, it reconstructs them from patterns. Think of petabytes of text compressed down into a much smaller model. That compression is lossy, and that loss is where hallucination comes from.
No guarantee of truth
It optimises for plausibility, not correctness. People call it a "stochastic parrot": it's very good at sounding right, which is not the same as being right. In legal reasoning, that distinction is everything.
Martin's line

Truth is a problem you solve at the system layer, not at the model layer.

Part 1 · How AI actually works

How you make it reliable: the four wrappers

A raw model is just the engine. These four wrappers are what turn it into something you can depend on, and they're where almost all of the real difference lives.

Wrapper 01
Context
Feed it the right documents while it works: your playbooks, your precedents, the clause in question. It's the easiest and biggest lever you have. The industry calls this RAG, or retrieval-augmented generation.
Wrapper 02
Tools
Let it search the web, query systems, send mail, update records. The moment a model can take actions out in the world is the moment a chatbot turns into an agent.
Wrapper 03
Guardrails
Hard rules, plus a separate system watching every exchange. Off-limits topics, escalation thresholds, and what's allowed to ship without a human ever seeing it.
Wrapper 04
Human in the loop
Send the right outputs to a person, the exceptions and the flagged risks, rather than every single one. The model does the work where it's confident and hands off where it isn't.
The headline number

About 90% of whether a legal AI product is any good comes down to these four wrappers, not the underlying model.

Part 1 · How AI actually works

The capability ladder

The same model can sit at very different levels of autonomy, depending on how much of the work it's trusted to drive and where the human sits.

Rung 01
Chatbot
15%
autonomous
You drive every turn. You ask, it answers. ChatGPT in a browser. Useful, but the returns flatten pretty fast.
Rung 02
Copilot
50%
autonomous
Embedded where lawyers already work. It suggests, you execute. Every contract still passes through a lawyer who opened the document.
Rung 03
Agent
90%
human on exceptions
You give it an outcome and it does the multi-step work, triggered by something like an arriving email and watched the whole time by guardrails. A lawyer supervises the exceptions, and the work actually leaves the queue.
Part 2 · Why models keep improving

Four drivers, none of them close to exhausted

Releases are coming faster, not slower. Four levers explain why, and the one that matters most for legal is also the one improving fastest.

Driver 01
Scale
More data, more parameters, more compute. It still pays off, though the returns are starting to flatten. When they first did, people called the end of AI. They were wrong, and the next box is why.
Driver 02
Reasoning
Give the model room to think before it answers. This is "chain of thought," and it's where most of last year's gains came from. Martin's favourite example: ask an early model to write a detective story and only reveal the murderer in the last line, and it genuinely only works out who did it in the last line, because it never had a chance to think ahead.
Driver 03
Post-training
Teaching it to check its own reasoning, follow complex rules, and refuse cleanly. It's the most secretive part of the whole pipeline. Each lab guards it closely and very little gets published.
Driver 04
Scaffolding
Everything around the model: retrieval, tools, evaluation, and multi-agent flows where a big agent spawns smaller ones to do the work. This is improving fastest, and it's what matters most for legal.
The shift this creates

Everyone rents the same engines now. The question is no longer how good the model is. It's what you do with it.

Part 3 · A tool is not an outcome

Hand a contract to a general-purpose agent

Claude Cowork, the Claude plug-in for Word, even a tool tuned for legal: underneath, they're all the same shape. A powerful model, a handful of general tools (read, search, edit, comment), pointed at a document. And in a demo it genuinely works.

But once you actually look under the hood, the problems start to show:

What the general agent does
  • Burns tokens on everything around the document, not the review itself.
  • Reads non-systematically: it skims the front, jumps to the end, then hops between clauses.
  • No ledger of what it's already seen, so it can silently skip provisions.
  • Stochastic: run it three times and it can change its mind on whether a clause is compliant.
Why, and it's not a criticism
  • It's a general-purpose model. The same one writes code, designs slides, and reviews your contract.
  • It was never designed to do one thing exhaustively and identically every time.
  • The quirks that don't matter for a quick chat surface hard on complex, repeated legal work.
  • The fix isn't a better model. It's taking the wheel back at the system level.
Part 3 · A tool is not an outcome

Take the wheel back: system-directed review

It looks less impressive in a demo. But across hundreds of documents it's far more stable, and it gives the lawyer an outcome they can actually count on.

1
Decompose the contract
A .docx is broken into a small, structured module of clauses and sections. Nothing depends on the model deciding where to look.
2
The system decides what gets reviewed, not the model
Every clause, every section, no jumping around. Effort is deliberate: an indemnity cap is never brushed off, it's hard-coded to get proper attention.
3
Redundant review: check each provision three times
Three independent passes over the same clause against the same playbook, then reconciled. Much like a senior lawyer reading a tricky clause twice.

What the three passes tell you

✓✓✓
All agree
Hallucinations wash out in aggregate. A made-up issue turns up in one pass but not the others, so the consensus is far more likely to be right.
They disagree
Often a sign of ambiguous wording in your playbook, not a model failure. A useful signal: fix the playbook, and flag this one for a human now.
No switching
If the answer is stable across runs, it's far more likely factually correct. If it switches, delegate to a human to be safe.

Redundancy costs more to run. The cost is worth it: hallucinations shouldn't be a problem in this day and age.

Part 4 · Closing the gap

Orchestration is only one quarter of it

The clever review logic above is only one piece of it. There are three more, and these are the ones a competitor can't copy, because they get built for your team over time.

01
Orchestration
Exhaustive, deliberate, redundant review. The system decides what gets looked at and how hard, instead of letting the model wander.
02
Legal engineering
Your playbooks, your positions, and the implicit knowledge sitting in your senior lawyers' heads, drawn out and made usable by a machine. This is the part no competitor can copy, because it's yours.
03
Implementation & integration
Wired into your DMS, your CLM, and your inbox, the way work actually arrives and leaves. A review that lives in a tool nobody opens is not an outcome.
04
Monitoring & supervision
Every run is watched, drift gets caught early, and only the right exceptions go to a human. Their judgment then feeds back in, so the system keeps improving without anyone retraining the model.
The structural point

The model is the easy quarter. Everyone rents the same one. The other three get built, for your team, over time.

Part 5 · The outcome you can trust

Reliability, as a property of the system

You don't fix the model. You build a system around it that makes reliability a feature, something that's there by design rather than just hoped for.

01
Completeness
Every provision, every time. It's the system, not the model, that guarantees nothing gets skipped.
02
Consistency
Same contract in, same review out, whether it runs at 9am or at midnight.
03
Hallucination suppression
Redundant passes filter the noise out before it reaches a person. It shouldn't really be a problem anymore.
04
Formatting protected
Deterministic tooling makes the edits, not the model. This is the hardest one to get right (more on that below).
The definition of "reliably"

"Reliably" isn't a smarter model. Everyone has the same models. It's the discipline of refusing to let the model decide what to skip.

Part 5 · The honest section

Where even this still struggles

There are two places this architecture doesn't magically solve, and Martin was candid about both.

Judgment
Coverage and consistency can't manufacture a view where the playbook is silent. Novel provisions, adversarial drafting, real judgment calls: those go to a human. The system's job is to make sure they're the only things that ever reach one.
The format itself
A Word file isn't really text. Under the hood it's a zip of deeply nested XML (OOXML). One small malformation and Word breaks. The fix isn't a smarter model, it's not letting the model freely rewrite the document, and honestly nobody has fully solved it yet.
Plain English

The jargon, defined

Every term Martin used, in one place, so you can decode any vendor's pitch and ask the right questions.

LLM · large language model
A statistical model trained on enormous amounts of text that predicts the next word (token) one at a time. The "AI" underneath ChatGPT, Claude, and the rest.
Token
A sub-word chunk of text, and the unit a model actually reads and predicts. Breaking words into tokens turns out to be more efficient than using whole words.
Pre-training
The first, most expensive phase: reading internet-scale text and adjusting billions of weights to learn word probabilities. Produces a raw "base model."
Post-training
The second phase, which gives a model its personality and its skills: following instructions, declining harm, reasoning. It's where most recent capability jumps come from.
Hallucination
A confident, plausible-sounding answer that isn't true. A direct consequence of lossy compression: the model reconstructs facts from patterns rather than retrieving them.
RAG · retrieval-augmented generation
Giving the model the relevant document or passage as context before it answers, so it has the facts in front of it instead of guessing from memory.
Agent
An LLM equipped with tools and able to take multi-step actions: triggered by events, working toward an outcome, escalating to a human on the exceptions. A step beyond a chatbot.
Chain-of-thought reasoning
Forcing the model to think through a problem before answering, instead of blurting the first plausible token. The reason it can now solve the "reveal the murderer last" test.
Guardrails
Hard rules plus a separate monitoring system watching every exchange. It defines off-limits topics, escalation thresholds, and what may ship without a human.
Human in the loop
A person reviewing the outputs that warrant it, the exceptions and flagged risks, rather than every result. Supervision built into the system rather than bolted on afterwards.
Scaffolding
Everything built around the model: retrieval, tools, evaluation, orchestration, multi-agent flows. The fastest-improving driver, and the one that matters most for legal.
Distillation
Training a smaller model on a large one to make it better and cheaper at a narrow task, at the cost of being worse at everything else.
OOXML
The format inside a Word .docx, a zip of deeply nested XML. It's why redlining reliably without breaking the file is genuinely hard.
Orchestration
The system-level logic that directs the model, deciding which clauses get reviewed, how hard, and how many times, instead of letting the model wander.
If you remember five things

The takeaways

  1. An LLM is a next-word predictor, not a knowledge base. It optimises for plausibility rather than truth, so truth is something you have to engineer at the system layer.
  2. The four wrappers, context, tools, guardrails, and human-in-the-loop, are about 90% of whether a legal AI product is any good. The model is the easy part.
  3. A general-purpose agent reviewing a contract is impressive in a demo and unreliable at scale. Taking the wheel back, with system-directed, exhaustive, redundant review, is what makes it dependable.
  4. A tool is not an outcome. Everyone rents the same model. The gap gets closed by orchestration, legal engineering, integration, and supervision, three quarters of which are built for your team over time.
  5. Supervision is becoming the critical role. The system handles what it's confident about and escalates the rest, the human's judgment feeds back in, and the whole thing keeps improving without retraining the model.
The four questions to take to any vendor

Martin and Lily kept coming back to four questions that separate a tool from an outcome:

01. Does it actually do the work, or just make a lawyer faster?

02. Does it run on your playbooks, or on generic training data?

03. Does it handle the full workflow, or just one step?

04. Who supervises the output?

Take it with you

Download the deck

Share the full presentation with your team, IT, or procurement. Everything Martin walked through, in one PDF.

Download PDF
Subscribe

The Intake

Weekly briefings on what's actually changing in legal AI: the market shifts, the regulatory moves, and the structural questions that matter for enterprise legal teams. We'll also send details of the next webinar.

Subscribe on Substack
Flank

Outsource legal work to supervised agents

Enterprise legal teams use Flank to handle high-volume contracting end to end: NDAs, MSA redlines, procurement, triage. Agents that know your templates, your terms, and your escalation rules.

Learn more at flank.ai