You hear "AI" and "agents" all the time, and the target keeps moving. This is the technical foundation underneath — in plain English, no engineering degree required. What a language model actually is, why it hallucinates, what makes an agent an agent, and what it takes to make any of it reliable enough for legal work.
If you're non-technical and haven't had much exposure to statistics, large language models can seem cryptic and black-boxy. They aren't anything magical — they are just statistical models. Some input comes into the model, some calculations happen inside it, and an output comes out.
Statistics has been around a long time, and machine learning builds on it. For the past hundred years or so we've used statistical models to predict quantities — if you know how old a tree is, you can predict how tall it is — and to classify things, like flagging an invoice as likely fraudulent. We've been doing that for ten or fifteen years.
What's different about a large language model is that it's recursive. You give it an input — say, the opening of a clause — and it predicts the next token. Not even the next word, strictly: the labs figured out it's more efficient to break words into smaller chunks called tokens. And it doesn't predict just one token. It predicts a full distribution of plausible continuations.
With no other context, "England" and "New York" are both perfectly plausible. To predict what comes after "England", the model takes "England", appends it to the sentence, and makes another pass through the model. Then again, and again — one prediction at a time, in a recursive loop. You've seen this with your own eyes: in early versions of ChatGPT, the text was written actively onto your screen, word by word. That's still happening. It comes from the nature of the model.
The models you're using today are trained on an internet-scale corpus — computer code, contracts, Wikipedia, books, the web. The more, the better. That is genuinely the entire core trick. Researchers studied this a decade ago as "scaling laws": the more data and the bigger the model, the more capable it becomes — and so far that has held fairly well.
One thing worth saying out loud: we know a lot about these models, and we also don't know much. We can train them, and we understand how they work mechanically. But at the size they've reached, it's very hard to disassemble one and answer a specific question like why did it pick England here, and not New York? That opacity comes from how they're trained — which is the next thing to understand.
Training happens in two stages, and the distinction matters more than almost anything else in this piece.
The leaps you've seen over the last couple of years have come mostly from the second phase — not from ever-bigger models. Worth holding onto, because it explains a lot of what follows.
Three things follow directly from how these models are built, and all three matter enormously for legal work.
There is no guarantee of truth at the model level. The model optimises for plausibility, not correctness. Truth is a problem you solve at the system layer — not the model layer. Especially in legal work, that distinction is everything.
So how do you make it reliable — whether you want to use these models efficiently in your own work, or you're assessing software built with one at its core? There are four levers, and they sit around the model, not inside it.
Everyone has access to the same models. The overwhelming share of whether a legal AI product is any good is these four wrappers — the context, tools, guardrails, and supervision built around the model — not the underlying model itself.
A useful way to place any product you're shown is a capability ladder with three rungs. The difference between them isn't intelligence — it's who initiates the work, who guides it, and where the human sits.
The move from rung two to rung three is a very different model of work — who initiates the conversation, who does the work, and where the human sits in the whole loop.
Models are being released faster than ever — and counting only the meaningful releases, the ones that actually move benchmarks. Four levers explain why.
Even if the models stopped improving tomorrow, we don't yet understand what we can get out of them at their current capability. Everyone has the same engines. The question is no longer how good the model is — it's what you do with it.
So let's run the experiment everyone is running right now. Take a powerful model, give it a handful of general tools — read the document, search it, edit, comment — and point it at a contract. Claude Cowork, the Claude plug-in for Word, the various Word add-ins: all the same shape underneath.
And it works. Genuinely. We've all seen the demos, and it does a relatively impressive job — especially in demos. Until you actually look under the hood at what the agent is doing with its effort.
What we observed, experimenting with many of these tools, is that the agent burns a lot of tokens not reviewing the document but doing work around it. It sees it has access to a document and tries to orient itself: it reads the front matter to understand what the document is, probably reads the last page to see where it ends and whether it's cut off, and then jumps around reviewing clauses and sections in no fixed order. And by introducing that kind of messy process, predictable problems follow:
None of this is a criticism of the models. It is a fact that they are general purpose. The model reviewing your third-party paper is the same model I use to write computer code, and the same model our designer used to build the webinar slides. They were not designed to do one thing — that's a feature. But the small quirks come to the top when you ask one to do relatively complex work, unconstrained. (A question we often get: is it the same Claude model for legal? Yes, it is. No lab ships a legal-specific frontier model — OpenAI tried a code-specific one with early Codex and let it go, because the really beefy general models did a better job.)
None of this is the model being weak. It's the model being unconstrained.
Contract review has a hard requirement: it should be exhaustive and consistent. Every provision. Every time. The same way. No jumping left and right — a list of things it needs to do, no questions asked. A self-directing generalist can't guarantee that. It can only try — and "usually" isn't the bar when the clause it skipped ends up in front of a regulator.
This is the point where it's worth introducing the distinction that many failing demos blur. On one hand, we all have this capability at our fingertips now. It is necessary. But capability is not what you need.
The gap between those two is the actual work. Closing it takes four things — none of them the model: orchestration, legal engineering, integration, and supervision. The next two sections take each in turn.
Speaking from experience: you can take the wheel back from the model a little. It will look less impressive in a demo. It will be much more stable — and a guaranteed outcome for the lawyer using it — across the hundreds of documents you'll actually be reviewing.
Instead of letting the agent be self-directed and jump around, the review becomes system-directed: the system decides that every single clause and every single section gets reviewed by the model. Start from the docx, decompose the contract into a very small, modular structure, and every piece is properly taken care of. Completeness stops being something you hope for and becomes something you enforce — omission isn't an option the model is allowed to take.
The system also dictates that effort is deliberate. You don't want an indemnity cap just brushed off; you want indemnity properly looked at, as one of the more important things the model spends effort on, while boilerplate like notices gets a lighter pass. Everything still gets covered — but effort is allocated, not improvised. That is hard-coded in the orchestration, not left to the model's mood.
For the switching and hallucination problems, the system-level answer is redundancy. Instead of letting the model take a clause and make one judgment — even with plenty of reasoning tokens — take one provision and review it three times, in parallel: three independent judgments on the same provision against the same playbook. It is more expensive, and the cost is really worth it, because each disagreement pattern tells you something different:
| What the three passes show | What it usually means | What the system does |
|---|---|---|
| All three agree | The decision is much more likely correct against the playbook. | It stands. Hallucinations are one-off by nature — they generally wash out in the aggregation. |
| They disagree | Very often not hallucination at all — it's ambiguous wording in your playbook. | That disagreement is a useful signal. It feeds a loop back into the playbook, so the next review doesn't carry the ambiguity. |
| They switch across runs | Genuine uncertainty. | Delegate to a human — just to be safe. |
The model still does the thinking. The system decides what gets thought about, and how much.
That was just the orchestration. Three more pieces close the gap between capability and outcome.
The model is the easy quarter — everyone rents the same one. The other three are built, for your team, over time.
The point of all of this is not to modify the LLM. The technology is incredibly capable as it is. The point is to build a system around it that makes reliability a feature you don't only hope for — it is there by design:
Two places deserve candour. First, judgment. Coverage and consistency don't manufacture a view where the playbook is silent. Novel provisions, adversarial drafting, true judgment calls — those go to a human. The system's job is to bring that judgment in close: escalate, get the judgment out of the human, then inject it back into the system so it keeps looping and improving — and to make sure those are the only things that reach them.
Second, the format itself. A Word file isn't text — under the hood it's OOXML, a zip of deeply nested XML. What general agents do is write find-and-replace instructions against that XML, and very often the match isn't perfect and it breaks the document. The Word plug-ins are getting better through tooling, but the failure modes around inserting fresh text persist. The fix isn't a smarter model — it's not letting the model freely rewrite the document. Nobody has fully solved this one yet, including us; it's where a lot of our engineering time goes.
"Reliably" isn't a smarter model — everyone has the same models. It's the discipline of refusing to let the model decide what to skip.
The point of understanding all of this isn't the trivia. It's that you can now ask vendors the right kind of questions — and properly assess the answers for yourself, whether you're buying, building, or making the case to your CFO.
One closing thought on that last question. Giving responsibility for an outcome to an autonomous system is the question behind every other question — and the answer is the same as the answer to what the future of the legal profession looks like. Knowledge work is moving towards supervision as a critical role: the human continually monitoring the agent, the agent trained to escalate anything high-risk or uncertain. Legal supervision may well become one of the most critical roles in the profession — and that's the direction of travel, not an afterthought bolted onto it.
Any tool — even a great one — gives you capability. An outcome takes orchestration, legal engineering, integration, supervision. Same model underneath — the gap between the two is the product.
This piece is adapted from Under the Hood: How agentic AI actually does contract work (reliably) — the live session presented by Martin Lukac, CTO at Flank, on 3 June 2026, with Q&A from Lily, CEO, and Jake, founder. Registrants receive the full recording.