Under the Hood Recap: How agentic AI actually does contract work (reliably)

In one minute

The short version

Everyone now has access to the same powerful models. So the model itself is no longer what separates a good legal AI product from a bad one. What separates them is everything you build around the model.

Martin started from first principles: what a large language model actually is (a statistical next-word predictor, nothing magical), and why that design means it optimises for plausibility rather than truth. From there he covered the four things you wrap around it (context, tools, guardrails, and human supervision) that turn a clever autocomplete into something you can trust across thousands of contracts. His honest punchline: reliability isn't a smarter model. It's the discipline of refusing to let the model decide what to skip.

Read

This recap

The highlights, the diagrams in plain text, and a glossary of every term Martin used.

Start reading ↓

Watch

The recording

The full session on video, including the live Q&A at the end.

Watch the session ↓

Download

The slide deck

The full presentation as a PDF. Share it with your team, or with IT and procurement.

Download the deck ↓

The deck

Download the full presentation

Every slide from the session: the next-word example, the capability ladder, the four wrappers, and the redundant-review diagram. Handy for briefing your IT, procurement, or board.

Download PDF

Watch

The full session

Martin's 45-minute talk, plus the live Q&A with Martin and Lily at the end. If you'd rather skim, the written recap below covers all of it.

The frame

Adoption is everywhere. Impact isn't.

The defining problem for in-house legal in 2026 isn't whether to adopt AI. It's why adoption hasn't moved the outside-counsel line.

87%

of GCs report using AI, up from 44% last year

say it's actually reduced their outside-counsel spend

90%

of whether a legal AI product is good comes down to the wrappers, not the model

The gap between those first two numbers is the whole talk. Everything below is about closing it.

Part 1 · How AI actually works

What a large language model actually is

A large language model isn't anything magical. It's a statistical model of language, and at runtime all it does is predict the next word, one at a time, in a loop. That really is the core trick. Everything else is engineering you build around it.

It learned to do this by reading an internet-scale pile of text: books, code, contracts, the web. The bigger and better that pile, the better the model gets. So it isn't looking an answer up anywhere. It's generating the most plausible continuation of whatever you gave it.

It's predicting one word at a time

"This agreement shall be governed by the laws of ▮"

England

61%

New York

23%

the State

The model doesn't pick one word. It produces a whole distribution of plausible next words and samples from it. With nothing else to go on, both "England" and "New York" are reasonable, so it's really just guessing. (And technically it predicts tokens, which are sub-word chunks, rather than whole words.)

Part 1 · How AI actually works

How it learned, in two phases

01 · Pre-training

Read everything

It reads everything and adjusts billions of weights until its predictions match what humans actually wrote. This is the expensive part, hundreds of millions of dollars, which is why it's so hard to enter this business as a newcomer. What you get out is a "base model." It isn't learning facts here. It's learning word probabilities.

02 · Post-training

Teach it to be useful

This is where the model gets taught to be useful: follow instructions, decline harmful requests, take on a personality. Humans rate its answers and it gets nudged toward the good ones. It's also where reasoning and tool use get baked in, and where most of the recent capability jumps have come from, not from simply building bigger models.

The point worth remembering

The leaps you've seen lately come from that second phase, the training that gives a model its judgment and its skills, not from making the model bigger.

Part 1 · How AI actually works

Three consequences you can't ignore

Because of the way it's trained, every model has three hard limits baked in. None of these are bugs. They're just properties of how the thing works.

⌛

No real-time knowledge

It was trained on a snapshot of the world. Ask it about yesterday and it either guesses or admits it doesn't know. Getting a model to update its own weights from the live world is one of the hardest open problems in the field.

🗜️

No database lookup

It doesn't retrieve facts, it reconstructs them from patterns. Think of petabytes of text compressed down into a much smaller model. That compression is lossy, and that loss is where hallucination comes from.

≈

No guarantee of truth

It optimises for plausibility, not correctness. People call it a "stochastic parrot": it's very good at sounding right, which is not the same as being right. In legal reasoning, that distinction is everything.

Martin's line

Truth is a problem you solve at the system layer, not at the model layer.

Part 1 · How AI actually works

How you make it reliable: the four wrappers

A raw model is just the engine. These four wrappers are what turn it into something you can depend on, and they're where almost all of the real difference lives.

Wrapper 01

Context

Feed it the right documents while it works: your playbooks, your precedents, the clause in question. It's the easiest and biggest lever you have. The industry calls this RAG, or retrieval-augmented generation.

Wrapper 02

Tools

Let it search the web, query systems, send mail, update records. The moment a model can take actions out in the world is the moment a chatbot turns into an agent.

Wrapper 03

Guardrails

Hard rules, plus a separate system watching every exchange. Off-limits topics, escalation thresholds, and what's allowed to ship without a human ever seeing it.

Wrapper 04

Human in the loop

Send the right outputs to a person, the exceptions and the flagged risks, rather than every single one. The model does the work where it's confident and hands off where it isn't.

The headline number

About 90% of whether a legal AI product is any good comes down to these four wrappers, not the underlying model.

Part 1 · How AI actually works

The capability ladder

The same model can sit at very different levels of autonomy, depending on how much of the work it's trusted to drive and where the human sits.

Rung 01

Chatbot

15%

autonomous

You drive every turn. You ask, it answers. ChatGPT in a browser. Useful, but the returns flatten pretty fast.

Rung 02

Copilot

50%

autonomous

Embedded where lawyers already work. It suggests, you execute. Every contract still passes through a lawyer who opened the document.

Rung 03

Agent

90%

human on exceptions

You give it an outcome and it does the multi-step work, triggered by something like an arriving email and watched the whole time by guardrails. A lawyer supervises the exceptions, and the work actually leaves the queue.

Part 2 · Why models keep improving

Four drivers, none of them close to exhausted

Releases are coming faster, not slower. Four levers explain why, and the one that matters most for legal is also the one improving fastest.

Driver 01

Scale

More data, more parameters, more compute. It still pays off, though the returns are starting to flatten. When they first did, people called the end of AI. They were wrong, and the next box is why.

Driver 02

Reasoning

Give the model room to think before it answers. This is "chain of thought," and it's where most of last year's gains came from. Martin's favourite example: ask an early model to write a detective story and only reveal the murderer in the last line, and it genuinely only works out who did it in the last line, because it never had a chance to think ahead.

Driver 03

Post-training

Teaching it to check its own reasoning, follow complex rules, and refuse cleanly. It's the most secretive part of the whole pipeline. Each lab guards it closely and very little gets published.

Driver 04

Scaffolding

Everything around the model: retrieval, tools, evaluation, and multi-agent flows where a big agent spawns smaller ones to do the work. This is improving fastest, and it's what matters most for legal.

The shift this creates

Everyone rents the same engines now. The question is no longer how good the model is. It's what you do with it.

Part 3 · A tool is not an outcome

Hand a contract to a general-purpose agent

Claude Cowork, the Claude plug-in for Word, even a tool tuned for legal: underneath, they're all the same shape. A powerful model, a handful of general tools (read, search, edit, comment), pointed at a document. And in a demo it genuinely works.

But once you actually look under the hood, the problems start to show:

What the general agent does

Burns tokens on everything around the document, not the review itself.
Reads non-systematically: it skims the front, jumps to the end, then hops between clauses.
No ledger of what it's already seen, so it can silently skip provisions.
Stochastic: run it three times and it can change its mind on whether a clause is compliant.

Why, and it's not a criticism

It's a general-purpose model. The same one writes code, designs slides, and reviews your contract.
It was never designed to do one thing exhaustively and identically every time.
The quirks that don't matter for a quick chat surface hard on complex, repeated legal work.
The fix isn't a better model. It's taking the wheel back at the system level.

Part 3 · A tool is not an outcome

Take the wheel back: system-directed review

It looks less impressive in a demo. But across hundreds of documents it's far more stable, and it gives the lawyer an outcome they can actually count on.

Decompose the contract

A .docx is broken into a small, structured module of clauses and sections. Nothing depends on the model deciding where to look.

The system decides what gets reviewed, not the model

Every clause, every section, no jumping around. Effort is deliberate: an indemnity cap is never brushed off, it's hard-coded to get proper attention.

Redundant review: check each provision three times

Three independent passes over the same clause against the same playbook, then reconciled. Much like a senior lawyer reading a tricky clause twice.

What the three passes tell you

✓✓✓

All agree

Hallucinations wash out in aggregate. A made-up issue turns up in one pass but not the others, so the consensus is far more likely to be right.

⚑

They disagree

Often a sign of ambiguous wording in your playbook, not a model failure. A useful signal: fix the playbook, and flag this one for a human now.

↻

No switching

If the answer is stable across runs, it's far more likely factually correct. If it switches, delegate to a human to be safe.

Redundancy costs more to run. The cost is worth it: hallucinations shouldn't be a problem in this day and age.

Part 4 · Closing the gap

Orchestration is only one quarter of it

The clever review logic above is only one piece of it. There are three more, and these are the ones a competitor can't copy, because they get built for your team over time.

Orchestration

Exhaustive, deliberate, redundant review. The system decides what gets looked at and how hard, instead of letting the model wander.

Legal engineering

Your playbooks, your positions, and the implicit knowledge sitting in your senior lawyers' heads, drawn out and made usable by a machine. This is the part no competitor can copy, because it's yours.

Implementation & integration

Wired into your DMS, your CLM, and your inbox, the way work actually arrives and leaves. A review that lives in a tool nobody opens is not an outcome.

Monitoring & supervision

Every run is watched, drift gets caught early, and only the right exceptions go to a human. Their judgment then feeds back in, so the system keeps improving without anyone retraining the model.

The structural point

The model is the easy quarter. Everyone rents the same one. The other three get built, for your team, over time.

Part 5 · The outcome you can trust

Reliability, as a property of the system

You don't fix the model. You build a system around it that makes reliability a feature, something that's there by design rather than just hoped for.

Completeness

Every provision, every time. It's the system, not the model, that guarantees nothing gets skipped.

Consistency

Same contract in, same review out, whether it runs at 9am or at midnight.

Hallucination suppression

Redundant passes filter the noise out before it reaches a person. It shouldn't really be a problem anymore.

Formatting protected

Deterministic tooling makes the edits, not the model. This is the hardest one to get right (more on that below).

The definition of "reliably"

"Reliably" isn't a smarter model. Everyone has the same models. It's the discipline of refusing to let the model decide what to skip.

Part 5 · The honest section

Where even this still struggles

There are two places this architecture doesn't magically solve, and Martin was candid about both.

Judgment

Coverage and consistency can't manufacture a view where the playbook is silent. Novel provisions, adversarial drafting, real judgment calls: those go to a human. The system's job is to make sure they're the only things that ever reach one.

The format itself

A Word file isn't really text. Under the hood it's a zip of deeply nested XML (OOXML). One small malformation and Word breaks. The fix isn't a smarter model, it's not letting the model freely rewrite the document, and honestly nobody has fully solved it yet.

Plain English

The jargon, defined

Every term Martin used, in one place, so you can decode any vendor's pitch and ask the right questions.

LLM · large language model

A statistical model trained on enormous amounts of text that predicts the next word (token) one at a time. The "AI" underneath ChatGPT, Claude, and the rest.

Token

A sub-word chunk of text, and the unit a model actually reads and predicts. Breaking words into tokens turns out to be more efficient than using whole words.

Pre-training

The first, most expensive phase: reading internet-scale text and adjusting billions of weights to learn word probabilities. Produces a raw "base model."

Post-training

The second phase, which gives a model its personality and its skills: following instructions, declining harm, reasoning. It's where most recent capability jumps come from.

Hallucination

A confident, plausible-sounding answer that isn't true. A direct consequence of lossy compression: the model reconstructs facts from patterns rather than retrieving them.

RAG · retrieval-augmented generation

Giving the model the relevant document or passage as context before it answers, so it has the facts in front of it instead of guessing from memory.

Agent

An LLM equipped with tools and able to take multi-step actions: triggered by events, working toward an outcome, escalating to a human on the exceptions. A step beyond a chatbot.

Chain-of-thought reasoning

Forcing the model to think through a problem before answering, instead of blurting the first plausible token. The reason it can now solve the "reveal the murderer last" test.

Guardrails

Hard rules plus a separate monitoring system watching every exchange. It defines off-limits topics, escalation thresholds, and what may ship without a human.

Human in the loop

A person reviewing the outputs that warrant it, the exceptions and flagged risks, rather than every result. Supervision built into the system rather than bolted on afterwards.

Scaffolding

Everything built around the model: retrieval, tools, evaluation, orchestration, multi-agent flows. The fastest-improving driver, and the one that matters most for legal.

Distillation

Training a smaller model on a large one to make it better and cheaper at a narrow task, at the cost of being worse at everything else.

OOXML

The format inside a Word .docx, a zip of deeply nested XML. It's why redlining reliably without breaking the file is genuinely hard.

Orchestration

The system-level logic that directs the model, deciding which clauses get reviewed, how hard, and how many times, instead of letting the model wander.

If you remember five things

The takeaways

An LLM is a next-word predictor, not a knowledge base. It optimises for plausibility rather than truth, so truth is something you have to engineer at the system layer.
The four wrappers, context, tools, guardrails, and human-in-the-loop, are about 90% of whether a legal AI product is any good. The model is the easy part.
A general-purpose agent reviewing a contract is impressive in a demo and unreliable at scale. Taking the wheel back, with system-directed, exhaustive, redundant review, is what makes it dependable.
A tool is not an outcome. Everyone rents the same model. The gap gets closed by orchestration, legal engineering, integration, and supervision, three quarters of which are built for your team over time.
Supervision is becoming the critical role. The system handles what it's confident about and escalates the rest, the human's judgment feeds back in, and the whole thing keeps improving without retraining the model.

The four questions to take to any vendor

Martin and Lily kept coming back to four questions that separate a tool from an outcome:

01. Does it actually do the work, or just make a lawyer faster?

02. Does it run on your playbooks, or on generic training data?

03. Does it handle the full workflow, or just one step?

04. Who supervises the output?

Take it with you

Download the deck

Share the full presentation with your team, IT, or procurement. Everything Martin walked through, in one PDF.

Download PDF

Under the Hood: what actually makes agentic AI reliable

The short version

Download the full presentation

The full session

Adoption is everywhere. Impact isn't.

What a large language model actually is

It's predicting one word at a time

How it learned, in two phases

Three consequences you can't ignore

How you make it reliable: the four wrappers

The capability ladder

Four drivers, none of them close to exhausted

Hand a contract to a general-purpose agent

Take the wheel back: system-directed review

What the three passes tell you

Orchestration is only one quarter of it

Reliability, as a property of the system

Where even this still struggles

The jargon, defined

The takeaways

Download the deck

The Intake

Outsource legal work to supervised agents