Under the hood: how AI and agentic AI actually work

01 — What a large language model actually is

Not magic. A statistical model.

If you're non-technical and haven't had much exposure to statistics, large language models can seem cryptic and black-boxy. They aren't anything magical — they are just statistical models. Some input comes into the model, some calculations happen inside it, and an output comes out.

Statistics has been around a long time, and machine learning builds on it. For the past hundred years or so we've used statistical models to predict quantities — if you know how old a tree is, you can predict how tall it is — and to classify things, like flagging an invoice as likely fraudulent. We've been doing that for ten or fifteen years.

What's different about a large language model is that it's recursive. You give it an input — say, the opening of a clause — and it predicts the next token. Not even the next word, strictly: the labs figured out it's more efficient to break words into smaller chunks called tokens. And it doesn't predict just one token. It predicts a full distribution of plausible continuations.

Predicting one word at a time

"This agreement shall be governed by the laws of "

England

61%

New York

23%

the State

It doesn't look up an answer. It generates the most plausible continuation.

With no other context, "England" and "New York" are both perfectly plausible. To predict what comes after "England", the model takes "England", appends it to the sentence, and makes another pass through the model. Then again, and again — one prediction at a time, in a recursive loop. You've seen this with your own eyes: in early versions of ChatGPT, the text was written actively onto your screen, word by word. That's still happening. It comes from the nature of the model.

The models you're using today are trained on an internet-scale corpus — computer code, contracts, Wikipedia, books, the web. The more, the better. That is genuinely the entire core trick. Researchers studied this a decade ago as "scaling laws": the more data and the bigger the model, the more capable it becomes — and so far that has held fairly well.

One thing worth saying out loud: we know a lot about these models, and we also don't know much. We can train them, and we understand how they work mechanically. But at the size they've reached, it's very hard to disassemble one and answer a specific question like why did it pick England here, and not New York? That opacity comes from how they're trained — which is the next thing to understand.

02 — How it learned

Two phases of training

Training happens in two stages, and the distinction matters more than almost anything else in this piece.

Phase 1 — Pre-training

Read everything, predict the missing word

The internet-scale corpus is chunked into sentences, one word is masked, and the model is asked to predict it. Right prediction, positive reward; wrong prediction, negative reward. Repeat across the whole corpus, several times over. The result is called a base model. This is by far the most expensive part of the pipeline — you need hundreds of millions to do it efficiently, which is why it's almost prohibitive to train one from scratch and why so few labs do.

Phase 2 — Post-training

Where the model gets its flavour

A base model will happily tell you how to build a bomb — somewhere on the internet there's a recipe, and that's the most plausible continuation. Post-training is where models get their personality, their intention to follow instructions, their ability to decline harm. If you've ever felt that Claude has one personality and ChatGPT a slightly different one — that feeling comes from here. So do most of the recent capability jumps: reasoning, tool use, and the rest are ingrained at this stage.

The leaps you've seen over the last couple of years have come mostly from the second phase — not from ever-bigger models. Worth holding onto, because it explains a lot of what follows.

03 — Three consequences you can't ignore

What this way of training means in practice

Three things follow directly from how these models are built, and all three matter enormously for legal work.

Consequence 1

No real-time knowledge

The corpus is static. Whatever was true when the model was trained stays baked in, and the model does not update itself by observing the world. Building models that can modify their own weights is one of the biggest open frontiers — and it is very, very challenging.

Consequence 2

No database lookup

The model took petabytes of data — more than all our laptops combined could carry — and compressed it into something much smaller. That compression is lossy. The model can very plausibly reproduce patterns it has seen, but it is reconstructing, not retrieving.

Consequence 3

No guarantee of truth

Ask it something similar to its training data and it will produce a very plausible answer — which is not necessarily a truthful or factual answer. This lossy compression is one of the focal reasons these models hallucinate. People call them stochastic parrots for a reason.

The line to remember

There is no guarantee of truth at the model level. The model optimises for plausibility, not correctness. Truth is a problem you solve at the system layer — not the model layer. Especially in legal work, that distinction is everything.

04 — How you make it reliable

The four wrappers around the raw model

So how do you make it reliable — whether you want to use these models efficiently in your own work, or you're assessing software built with one at its core? There are four levers, and they sit around the model, not inside it.

01 — Context

If you're asking a factual question and you know there's a document somewhere containing the answer — maybe just take the document and put it into the prompt. However big or messy it is, the model is far more likely to answer factually. The industry calls this RAG: retrieval augmented generation. Pasting whole documents is inefficient, so engineering routines chunk big documents and feed the model the most plausible pieces at the right time. Your playbooks, your precedents, your clause library — this is how they reach the model.

02 — Tools

Instead of us trying to be smart about drip-feeding the right context at the right time, let the model figure it out. The first tool was web search: the model writes the Google query for you, retrieves the output, and answers much more truthfully. Today that has expanded almost infinitely — querying databases and calendars, sending emails, updating records. This is the transition from a chatbot, which answers, to an agent, which acts.

03 — Guardrails

Hard rules about what's off limits. Some are introduced through post-training; others run as a lateral system that watches every exchange between user and model and can intervene the moment it deems something dangerous or detects a lot of uncertainty. This is also what makes escalation thresholds possible — which clauses are off limits, what may ship without a human.

04 — Human in the loop

We are all — as societies and as businesses — still learning how humans and AI operate side by side. Where we are today: build systems where AI executes where it is confident, and for exceptions, flagged risks, or anything where it is uncertain, it clearly delegates to a human. The human reviews the right outputs — not every single one.

Why this matters when buying

Everyone has access to the same models. The overwhelming share of whether a legal AI product is any good is these four wrappers — the context, tools, guardrails, and supervision built around the model — not the underlying model itself.

05 — The capability ladder

Chatbot, copilot, agent

A useful way to place any product you're shown is a capability ladder with three rungs. The difference between them isn't intelligence — it's who initiates the work, who guides it, and where the human sits.

Chatbot

You ask a question, it answers. ChatGPT in a browser. We all tried it when it came out and it was very interesting — but a couple of years later you can see it was actually a very constrained experience. You can iterate and loop as many times as you want, but past a point the return on that lowers. You drive every turn.

~15% autonomous

Copilot

Embedded where you already work. Microsoft jumped at this early, putting models inside Word and Excel so the AI gets the context of the document and can do things in it. It proved much more challenging than anyone anticipated — and the initiation and guidance still come from the user, who opens the document and steers. It suggests; you execute.

~50% autonomous

Agent

Executes multi-step work, initiated by triggers from the world — an arriving email, a new request. It executes as much as it can, observed by guardrails. If there's no uncertainty or high risk detected, it produces an outcome. If a human is required, it escalates, receives the answer, injects it back, and finishes the work on its own.

~90% autonomous · human on exceptions

The move from rung two to rung three is a very different model of work — who initiates the conversation, who does the work, and where the human sits in the whole loop.

06 — Why models keep improving

Four drivers — none close to exhausted

Models are being released faster than ever — and counting only the meaningful releases, the ones that actually move benchmarks. Four levers explain why.

Scale

More data, more parameters, bigger models, more compute. It still pays off — but the returns are flattening. When that flattening first showed, people proclaimed the end of AI and called it all a bubble. Then something interesting happened.

Reasoning

Remember the recursive loop: one prediction at a time, with no chance to think about what the next token should be. Chain-of-thought reasoning forces the model to think first and answer second. My favourite example: ask an early LLM to write a detective novel and only reveal the murderer in the last sentence, and it would genuinely only figure out who the murderer was in the last sentence — it never had a chance to reason it through. With reasoning, it can. In legal work this matters directly: you don't want to force the model to answer straight away. You want to give it space to work through the logical reasons why an answer is the answer — and only then produce the result. This was one of the strongest levers behind the last year's gains.

Post-training

Not just safety — increasingly complex capabilities are built in at this stage. It is also the most secretive part of each lab's pipeline. Very little is published, so it's hard to say much in depth — which is itself worth knowing when a vendor waves the phrase around.

Scaffolding

Not the model — everything around it. Retrieval, tools, evaluation, detecting what's problematic and fixing it, and multi-agent flows, where one agent can spawn smaller agents to do work on its behalf. The compartmentalisation helps: a model that doesn't have to hold one very complex task is much more likely to do a better job. For legal, this is the driver that matters most — and it's improving fastest.

The strategic point

Even if the models stopped improving tomorrow, we don't yet understand what we can get out of them at their current capability. Everyone has the same engines. The question is no longer how good the model is — it's what you do with it.

07 — The experiment

Hand a contract to a general-purpose agent

So let's run the experiment everyone is running right now. Take a powerful model, give it a handful of general tools — read the document, search it, edit, comment — and point it at a contract. Claude Cowork, the Claude plug-in for Word, the various Word add-ins: all the same shape underneath.

And it works. Genuinely. We've all seen the demos, and it does a relatively impressive job — especially in demos. Until you actually look under the hood at what the agent is doing with its effort.

What we observed, experimenting with many of these tools, is that the agent burns a lot of tokens not reviewing the document but doing work around it. It sees it has access to a document and tries to orient itself: it reads the front matter to understand what the document is, probably reads the last page to see where it ends and whether it's cut off, and then jumps around reviewing clauses and sections in no fixed order. And by introducing that kind of messy process, predictable problems follow:

It can forget provisions — there is no ledger it keeps of "I've seen this, so I don't have to look at it again."
It reviews non-systematically — different documents get different routes through, different depths of attention.
It can switch — give it the same clause and the same playbook across different runs, and it can change its decision about whether the clause is compliant or non-compliant. That's the stochasticity in these models, and reasoning makes it more visible, not less.
It breaks formatting and tracked changes on the way through — more on why below.

None of this is a criticism of the models. It is a fact that they are general purpose. The model reviewing your third-party paper is the same model I use to write computer code, and the same model our designer used to build the webinar slides. They were not designed to do one thing — that's a feature. But the small quirks come to the top when you ask one to do relatively complex work, unconstrained. (A question we often get: is it the same Claude model for legal? Yes, it is. No lab ships a legal-specific frontier model — OpenAI tried a code-specific one with early Codex and let it go, because the really beefy general models did a better job.)

The diagnosis

None of this is the model being weak. It's the model being unconstrained.

08 — Capability vs outcome

A tool is not an outcome

Contract review has a hard requirement: it should be exhaustive and consistent. Every provision. Every time. The same way. No jumping left and right — a list of things it needs to do, no questions asked. A self-directing generalist can't guarantee that. It can only try — and "usually" isn't the bar when the clause it skipped ends up in front of a regulator.

This is the point where it's worth introducing the distinction that many failing demos blur. On one hand, we all have this capability at our fingertips now. It is necessary. But capability is not what you need.

What a tool gives you — capability

A capable model and a handful of general actions. Even one tuned for legal. Necessary — and not the thing you need. Everyone has access to the same models, so nobody is competing there. That's table stakes.

What you actually need — an outcome

Every contract reviewed completely and consistently, inside your workflow, in a form you can stand behind. Objections handled. A properly drafted document you can send to the counterparty without stress or trepidation.

The gap between those two is the actual work. Closing it takes four things — none of them the model: orchestration, legal engineering, integration, and supervision. The next two sections take each in turn.

09 — Orchestration

Take the wheel back from the model

Speaking from experience: you can take the wheel back from the model a little. It will look less impressive in a demo. It will be much more stable — and a guaranteed outcome for the lawyer using it — across the hundreds of documents you'll actually be reviewing.

Instead of letting the agent be self-directed and jump around, the review becomes system-directed: the system decides that every single clause and every single section gets reviewed by the model. Start from the docx, decompose the contract into a very small, modular structure, and every piece is properly taken care of. Completeness stops being something you hope for and becomes something you enforce — omission isn't an option the model is allowed to take.

The system also dictates that effort is deliberate. You don't want an indemnity cap just brushed off; you want indemnity properly looked at, as one of the more important things the model spends effort on, while boilerplate like notices gets a lighter pass. Everything still gets covered — but effort is allocated, not improvised. That is hard-coded in the orchestration, not left to the model's mood.

Redundancy: review each provision three times

For the switching and hallucination problems, the system-level answer is redundancy. Instead of letting the model take a clause and make one judgment — even with plenty of reasoning tokens — take one provision and review it three times, in parallel: three independent judgments on the same provision against the same playbook. It is more expensive, and the cost is really worth it, because each disagreement pattern tells you something different:

What the three passes show	What it usually means	What the system does
All three agree	The decision is much more likely correct against the playbook.	It stands. Hallucinations are one-off by nature — they generally wash out in the aggregation.
They disagree	Very often not hallucination at all — it's ambiguous wording in your playbook.	That disagreement is a useful signal. It feeds a loop back into the playbook, so the next review doesn't carry the ambiguity.
They switch across runs	Genuine uncertainty.	Delegate to a human — just to be safe.

The division of labour

The model still does the thinking. The system decides what gets thought about, and how much.

10 — The other three quarters

Orchestration is one quarter of it

That was just the orchestration. Three more pieces close the gap between capability and outcome.

Legal engineering

One of the most crucial inputs into these pipelines is your playbooks — and in their current form and shape, in most companies, they are not the easiest things for a large language model to read and work with. Legal engineering moulds them into a format, wording, and structure the model can work with better. Your playbooks, your positions, the implicit knowledge in your senior lawyers' heads, drawn out and made machine-usable — the part no competitor can copy, because it's yours.

Implementation & integration

The scaffolding needs to reach the places work actually arrives and leaves — your DMS, CLM, inbox, Outlook, ServiceNow — so the model can get the right context and execute on your behalf where it's safe to do so. A review that lives in a tool nobody opens is not an outcome.

Monitoring & supervision

Often overlooked, and probably one of the biggest levers in legal tech. You want to consistently monitor your agent's executions, see its traces, and see where it surfaces uncertainty. And you want to use the human's judgment, every time there's a human in the loop, as a signal to refine the system itself. The system isn't learning in the sense of modifying the underlying model — nobody is training new models here — but you are extracting signals and injecting them into future reviews so the model doesn't make the same mistake again.

The model is the easy quarter — everyone rents the same one. The other three are built, for your team, over time.

11 — The outcome you can trust

Reliability, as a property of the system

The point of all of this is not to modify the LLM. The technology is incredibly capable as it is. The point is to build a system around it that makes reliability a feature you don't only hope for — it is there by design:

Completeness

Everything is reviewed, every single time.

Consistency

The same contract in, the same review out. Same clauses, same calls, 9am or midnight.

Hallucination suppression

Thanks to redundancy, hallucinations shouldn't be a problem in this day and age anymore.

Formatting protected

A lot of legal work happens in Word, which is an inherently messy format, and formatting matters: for the optics, and for the branding of the companies whose paper it is. If the system inserts a redline, it should be nicely formatted and it shouldn't mess up the rest of the document. Again — that's not the model. That's tooling built around the model, with deterministic code making the edits.

Where even this struggles — the honest section

Two places deserve candour. First, judgment. Coverage and consistency don't manufacture a view where the playbook is silent. Novel provisions, adversarial drafting, true judgment calls — those go to a human. The system's job is to bring that judgment in close: escalate, get the judgment out of the human, then inject it back into the system so it keeps looping and improving — and to make sure those are the only things that reach them.

Second, the format itself. A Word file isn't text — under the hood it's OOXML, a zip of deeply nested XML. What general agents do is write find-and-replace instructions against that XML, and very often the match isn't perfect and it breaks the document. The Word plug-ins are getting better through tooling, but the failure modes around inserting fresh text persist. The fix isn't a smarter model — it's not letting the model freely rewrite the document. Nobody has fully solved this one yet, including us; it's where a lot of our engineering time goes.

What "reliably" means

"Reliably" isn't a smarter model — everyone has the same models. It's the discipline of refusing to let the model decide what to skip.

12 — What you can now ask

The questions this lets you put to any vendor

The point of understanding all of this isn't the trivia. It's that you can now ask vendors the right kind of questions — and properly assess the answers for yourself, whether you're buying, building, or making the case to your CFO.

Is the review self-directed or system-directed?

Does the model decide what to look at, or does the system guarantee every provision is reviewed, in order, every time?

How do you handle hallucination and switching?

Listen for redundancy — multiple independent passes, aggregated — not "the model is very accurate."

What happens when the passes disagree, or the playbook is silent?

The right answer involves a human, an escalation path, and a feedback loop into the playbook — not a confident guess.

How do you edit Word documents?

If the model freely rewrites the file, your formatting and tracked changes are at risk. Look for deterministic tooling around the model.

Where does it sit?

An enterprise outcome means the agent picks up work where it arrives — inbox, CLM, ServiceNow — not a tool each lawyer must remember to open and use consistently.

Who supervises, and what happens to their judgment?

Human checks should be a component of the system, not an afterthought — with the supervisor's calls injected back so the same mistake isn't made twice.

One closing thought on that last question. Giving responsibility for an outcome to an autonomous system is the question behind every other question — and the answer is the same as the answer to what the future of the legal profession looks like. Knowledge work is moving towards supervision as a critical role: the human continually monitoring the agent, the agent trained to escalate anything high-risk or uncertain. Legal supervision may well become one of the most critical roles in the profession — and that's the direction of travel, not an afterthought bolted onto it.

The whole talk in three sentences

Any tool — even a great one — gives you capability. An outcome takes orchestration, legal engineering, integration, supervision. Same model underneath — the gap between the two is the product.

This piece is adapted from Under the Hood: How agentic AI actually does contract work (reliably) — the live session presented by Martin Lukac, CTO at Flank, on 3 June 2026, with Q&A from Lily, CEO, and Jake, founder. Registrants receive the full recording.

How AI and agentic AI actually work