The supervision experience

01 — The premise

What supervision actually means

"Human-in-the-loop" has become a marketing phrase. What we mean by supervision is narrower and more specific: a lawyer or legal ops professional retains final authority over every piece of work the agent produces, but only intervenes where the agent is uncertain, the matter is novel, or the policy says so.

The word does a lot of work in this market, so it's worth being precise. There are at least three things people mean when they say "supervised AI" — and only one of them is what Flank is.

Not this

Pass-through

The AI drafts, a human clicks "approve" without reading. Supervision in name only. This is how most GenAI gets deployed, and it's how most hallucinations reach production.

Not this either

Line-by-line review

A lawyer reads every word the AI produces before it goes out. Safe, but takes almost as long as doing the work. This is where most AI contract tools leave you — and why pilots stall.

This

Exception-based supervision

The agent resolves what it's confident about, escalates what it isn't, and the supervisor focuses attention on the exceptions. The bulk of the work passes through without a human touching it — by design, and only after the system has earned that trust.

A fair concern

"If a supervisor is reading every redline before it goes out, how is this different from me doing the work myself?" Short answer: in week one, it isn't very different. The point is what happens in weeks four, twelve, and twenty-four — and whether the system is designed to get you there.

02 — The work

The four things a supervisor does

In a mature Flank deployment, a supervisor's job collapses to four discrete activities. Everything else the agent handles itself.

The Flank supervision dashboard in its full-browser view. A sidebar lists Agents, Tasks, Activity, and Supervision under a Flank.ai workspace. The middle column is a unified inbox of items from Email, MS Teams, the Flank app, Salesforce, and Helpdesk — GDPR queries, NDA review requests, sales-team questions, contract drafts. The right-hand columns show the conversation with William Smith about an NDA and a Review side panel with the tabs Non-compliant (28), Compliant (4), and Flank Flags (2); four clauses are listed with checkboxes under 'Awaiting Approval,' and a primary action at the bottom reads 'Approve selected changes.'

The full product in one frame. Work arrives from every channel the business already uses — email, MS Teams, Salesforce, Helpdesk, the Flank app — into a single supervisor queue. The agent picks up each item and drafts a response; the Review side panel groups its redlines into three categories (non-compliant with the playbook, compliant, and proactive Flank flags to the counterparty) and lets the supervisor approve them as a batch, clause by clause, or dismiss.

Approve exceptions

The agent flags a clause where its confidence is below threshold, or where the counterparty's position crosses a policy line. The supervisor reviews the flagged item — not the whole document — and either approves, edits, or escalates.

Resolve novelty

A new contract type, a new jurisdiction, a counterparty pattern the agent hasn't seen before. The supervisor makes the call, and the decision is captured as playbook data for next time.

Tune the playbook

Every approval, override, and escalation is a signal. The supervisor reviews patterns weekly — what's getting escalated that shouldn't, what's passing through that should be caught — and updates fallback positions, risk thresholds, and escalation rules.

Handle the hard stuff

Strategic negotiations, bet-the-company clauses, internal stakeholder disputes. The agent is explicitly instructed to route these to a human — and to stay out of the way until the human is ready to hand back.

Notice what's not on this list: drafting, reviewing full documents, chasing the business for information, logging outcomes, following up on signatures. The agent does all of that. A supervisor's attention is the scarcest resource in the system, and the system is built to spend it only where it matters.

03 — The curve

Week by week: from day one to mature deployment

The most honest thing we can say about supervision is that it is heavier at the start than it will be later. Prospects who expect day-one autonomy are disappointed. Prospects who brace for a permanent line-by-line review burden are surprised — usually by week four, sometimes earlier.

The supervision workload curve

% of agent outputs a supervisor reviews end-to-end, by deployment week

Indicative curve based on typical mid-market NDA / MSA flows across Flank deployments. Higher-stakes commercial matters and regulated workflows converge more slowly. Y-axis is the share of outputs reviewed end-to-end by a supervisor — exception flags, playbook tuning, and strategic escalations are handled separately and sit outside this measure.

Week 1 — Calibration

Nearly every output is reviewed. The supervisor is teaching the system what "good" looks like for this team's templates, tolerances, and edge cases. Expect to spend roughly the same time you would have spent doing the work yourself. This is the price of admission, not the steady state.

Weeks 2–3 — Narrowing

Confidence scores start clustering higher on known contract types. The supervisor stops reviewing full documents and starts reviewing flagged sections. Review time per contract drops by roughly half. The agent begins handling recurring fallback positions without escalation.

Weeks 4–8 — Exception-based

The shape of the work changes. Most contracts pass through without a supervisor touching them. The supervisor's queue becomes a short list of genuine exceptions — novel counterparty positions, edge cases, internal escalations. Time per contract drops to minutes, not hours.

Month 3+ — Steady state

The supervisor is no longer a bottleneck in the workflow. Their role shifts from reviewing individual outputs to tuning the system — updating playbooks, adjusting thresholds, reviewing weekly metrics. A team that supervised twenty contracts a week is now supervising a handful of exceptions and spending the rest of their time on work that requires judgement.

~90%

of outputs reviewed in week 1

~40%

reviewed by week 3

~10%

reviewed by month 2

<5%

by month 3+

Figures are typical for a mid-market NDA / MSA flow across Flank customers. Higher-stakes commercial matters and regulated workflows converge more slowly.

04 — The mechanism

Risk-based approval and confidence scoring

The curve in the previous section is not the result of the model getting smarter week over week. It's the result of a workflow that grades every output on two axes — confidence and risk — and uses the grade to decide what reaches a human.

Axis 1

Confidence

How sure is the agent about the output? Derived from the model's own signals, semantic agreement across redundant generations, and retrieval match strength against your playbook. Expressed as a continuous score, not a binary.

Axis 2

Risk

How much does it matter if this output is wrong? Set by the customer at policy level — clause type, contract value, counterparty tier, jurisdiction. The agent inherits this from the playbook, not from its own judgement.

Confidence alone isn't enough. A high-confidence output on a bet-the-company indemnity still warrants a human look; a medium-confidence output on a standard governing-law clause does not. The two axes combine into a risk-based approval matrix that decides, per output, whether it auto-approves, escalates to the agent's own second pass, flags for supervisor review, or goes straight to a named escalation.

In the product, every flagged clause carries a visible reason. The agent surfaces the playbook position it's comparing against ("The playbook maximum is 12 months"), drafts the proposed redline with a diff view (strikethrough in red, replacement in green), and reduces the human decision to a single dismiss / confirm choice — or, where confidence is high enough to justify it, handles the clause silently and reports it in the auto-confirmed lane.

The Flank Review side panel on a mutual NDA. Tabs at the top show Needs review (1), Auto-confirmed (9), and Flank flags (0). The single item needing review is Non-compete Clause 7, titled 'Excessive Non-Compete Scope and Duration,' with an explanation that the clause prohibits competing in any business line for 24 months while the playbook maximum is 12 months limited to directly competing activities in the same market segment. A Suggested change section shows a redline: 'twenty-four (24)' struck through and replaced with 'twelve (12) months,' and 'directly or indirectly engage in any business line that competes with the other party' struck through and replaced with 'directly engage in activities that directly compete with the other party in the same market segment as the Purpose defined herein.' Comments count is 1. Dismiss and Confirm buttons sit at the bottom.

The confidence + risk matrix made visible at the clause level. On this mutual NDA the agent handled nine clauses on its own and surfaced one — an over-wide non-compete that violates the playbook on both duration and scope — with the reasoning spelled out and the suggested redline already drafted.

The ratio that matters

The Review panel reads "Needs review 1 · Auto-confirmed 9 · Flank flags 0." Nine clauses out of ten passed through without a supervisor touching them. The one that didn't arrives pre-diagnosed and pre-drafted — all the human has to do is look, think, and click. That ratio, and the shape of what reaches the queue, is the whole product.

How the thresholds are set

The thresholds that govern "what gets escalated" are not set by Flank. They're set by the customer during sprint zero and tuned across the first month of deployment. A conservative team can dial up escalation on day one, then relax it as trust builds. A more aggressive team can run with lower thresholds from the start. Either way, the supervisor is in control of the tap.

Risk tier	Typical threshold	Who sees it
Low — standard clauses	Auto-approve if confidence > 90%	No human unless flagged
Medium — fallback positions	Auto-approve if confidence > 95% and inside playbook	Legal ops spot-check
High — commercial terms	Always flag for review	Supervising lawyer
Critical — custom / bet-the-company	Never auto-handled	Named escalation

05 — The distinction

Line-by-line review is a symptom, not the job

One of the most common things we hear from teams mid-pilot is some version of: "my supervisors are doing line-by-line review that takes almost as long as doing the work themselves." Almost always, this is a signal that the setup is still in its first three weeks — and that the playbook, thresholds, or escalation rules need tuning, not that supervision itself is broken.

Day one

Line-by-line review

The supervisor reviews every clause because the system hasn't yet earned the right to pass anything through without a check. Workload is roughly equivalent to doing the work yourself. This is the bootstrapping period — necessary, finite, and the only way to get to the next stage.

Mature deployment

Exception-based supervision

The supervisor reviews only what the system can't confidently resolve, or what policy requires a human sees. Most contracts pass through untouched. The supervisor's time is spent on genuine judgement calls, not a rubber-stamp exercise on standard work.

The Review side panel in use. The supervisor works through the "Needs review" queue on a real MSA — approving the proposed redlines where the playbook supports the change, editing where it doesn't, dismissing false positives. Auto-confirmed clauses and proactive Flank flags sit in their own lanes alongside. By the end of the clip, the queue has emptied and the contract is ready to send.

The discomfort in the first three weeks is real, and we don't hide it from prospects. The question any evaluator should ask isn't "how much supervision will I have to do?" — it's "is the system designed to reduce the supervision workload over time, and is there evidence it actually does?" The answer to both should be yes, with data to back it up.

The honest framing

If your supervision workload isn't dropping meaningfully between week one and week four, something is wrong with the setup, not with the concept. Playbooks too thin, thresholds mistuned, or the work doesn't actually fit the agent yet. In every case, the fix is visible in the data — which is why we run a weekly review across the first eight weeks of every deployment.

06 — The roles

Who supervises what

Supervision is not a single job. In most deployments, the work splits across three roles that already exist in the legal team — lawyers, legal ops, and legal admin. The agent routes each class of exception to the role best placed to handle it.

Role	Supervises	Typical time commitment
Lawyer	Commercial terms, legal exceptions, novel counterparty positions, escalations that touch risk or strategy.	~30 min/day at steady state. Concentrated on high-value judgement calls, not volume.
Legal ops	Playbook maintenance, threshold tuning, weekly metrics review, exception patterns, integration health.	~2–3 hours/week. Treats the agent like any other operational system under their remit.
Legal admin / paralegal	Low-risk procedural checks, counterparty data validation, signature routing, internal stakeholder follow-ups the agent flags for a human touch.	~1 hour/day. Much of this work simply wasn't happening reliably before.

The important point: no role is doing work beneath its seniority. The lawyer is not reading NDAs line by line. The legal ops lead is not chasing counterparty signatures. The admin is not making commercial judgement calls. This separation is the whole point, and it's what collapses when teams try to wrap a base model themselves and end up with one person approving everything the model produces.

07 — The standard

What "supervised to near-zero error" looks like

Accuracy metrics in isolation mislead. A 93% accurate agent sounds worrying; a 93% accurate agent behind a well-tuned supervision layer is, in production, closer to 99.5% — because the 7% the agent gets wrong is disproportionately the 7% it's least confident about, which is disproportionately what gets escalated to a human.

The right framing

Supervision is not a tax you pay on agent inaccuracy. It's the mechanism that converts model accuracy into system reliability. The raw model output is the floor. The supervised output is the ceiling. The gap between the two is where the product actually lives.

What we track, and what we share with customers

Every supervised action is logged against the counterfactual — would a human reviewer have caught what the agent caught, would they have flagged what the agent flagged, would they have drafted the redline the agent drafted. Every week of a deployment produces a supervision quality report covering:

Agent-side metrics

Confidence calibration — does 90% confidence actually mean 90% correct?
Escalation rate by contract type and risk tier
Playbook coverage — what fraction of outputs fall inside the playbook vs novel
Redline acceptance rate from counterparties

Supervision-side metrics

Supervisor override rate — how often humans disagree with the agent
Median review time per flagged output
Escalation-to-resolution latency
Drift markers — are override rates trending up or down over time?

Supervision quality report — sample, week 8 of deployment

Illustrative figures for a mid-market SaaS contracting flow (NDAs, MSAs, DPAs, vendor T&Cs)

Every Flank deployment produces a weekly report along these lines. Figures shown are representative of a well-tuned week-8 deployment — real reports include richer drill-downs (confidence calibration by contract type, override patterns by supervisor, latency distributions, and drift markers against the prior eight weeks).

08 — The transformation

Day one vs. month three

The clearest way to describe the supervision experience is to contrast it with itself. Here's what the same legal team looks like at two points in the same deployment.

Day one

Supervisor reviews ~90% of outputs end-to-end
Playbook captures the team's most common positions but has gaps
Escalation thresholds intentionally conservative
Confidence calibration still being verified against human ground truth
Weekly review meeting with Flank is dense — lots of playbook edits
Time spent per contract: comparable to doing the work manually
Team still sceptical; waiting for evidence

Month three

Supervisor reviews <5% of outputs — only genuine exceptions
Playbook is well-developed, covering edge cases as they've been resolved
Escalation thresholds tuned to the team's actual risk tolerance
Confidence scores well-calibrated against months of override data
Weekly review is light — mostly metrics and drift checks
Time spent per contract: minutes, for the small fraction that needs it
Team is using freed time for work that was previously deferred

This is not an aspirational before-and-after. It's the pattern we see across deployments, roughly on this timeline, when the setup is done properly and the supervisor is engaged in the first month. The deployments that don't resolve like this are the ones where the playbook never gets tuned, or the thresholds are set once and never revisited, or supervision is treated as a one-off project instead of a weekly discipline.

What to ask your vendor

If you're evaluating Flank or any other agentic legal service, don't ask "how accurate is it?" — ask "what does supervision look like in week one, week four, and month three, and can you show me the data for a customer at each stage?" The answer will tell you whether you're buying a tool or a service, and whether the curve actually exists in their deployments.