Product deep dive

The supervision experience

What legal teams actually do — and don't do — when they supervise Flank. A close look at the risk-based approval workflow, confidence scoring, and the curve from day one to a mature supervision setup.

Type Product deep dive
Reading time 15 min
Audience Legal ops, GCs, evaluators
01 — The premise

What supervision actually means

"Human-in-the-loop" has become a marketing phrase. What we mean by supervision is narrower and more specific: a lawyer or legal ops professional retains final authority over every piece of work the agent produces, but only intervenes where the agent is uncertain, the matter is novel, or the policy says so.

The word does a lot of work in this market, so it's worth being precise. There are at least three things people mean when they say "supervised AI" — and only one of them is what Flank is.

Not this
Pass-through
The AI drafts, a human clicks "approve" without reading. Supervision in name only. This is how most GenAI gets deployed, and it's how most hallucinations reach production.
Not this either
Line-by-line review
A lawyer reads every word the AI produces before it goes out. Safe, but takes almost as long as doing the work. This is where most AI contract tools leave you — and why pilots stall.
This
Exception-based supervision
The agent resolves what it's confident about, escalates what it isn't, and the supervisor focuses attention on the exceptions. The bulk of the work passes through without a human touching it — by design, and only after the system has earned that trust.
A fair concern

"If a supervisor is reading every redline before it goes out, how is this different from me doing the work myself?" Short answer: in week one, it isn't very different. The point is what happens in weeks four, twelve, and twenty-four — and whether the system is designed to get you there.

02 — The work

The four things a supervisor does

In a mature Flank deployment, a supervisor's job collapses to four discrete activities. Everything else the agent handles itself.

The Flank supervision dashboard in its full-browser view. A sidebar lists Agents, Tasks, Activity, and Supervision under a Flank.ai workspace. The middle column is a unified inbox of items from Email, MS Teams, the Flank app, Salesforce, and Helpdesk — GDPR queries, NDA review requests, sales-team questions, contract drafts. The right-hand columns show the conversation with William Smith about an NDA and a Review side panel with the tabs Non-compliant (28), Compliant (4), and Flank Flags (2); four clauses are listed with checkboxes under 'Awaiting Approval,' and a primary action at the bottom reads 'Approve selected changes.'
The full product in one frame. Work arrives from every channel the business already uses — email, MS Teams, Salesforce, Helpdesk, the Flank app — into a single supervisor queue. The agent picks up each item and drafts a response; the Review side panel groups its redlines into three categories (non-compliant with the playbook, compliant, and proactive Flank flags to the counterparty) and lets the supervisor approve them as a batch, clause by clause, or dismiss.
Approve exceptions
The agent flags a clause where its confidence is below threshold, or where the counterparty's position crosses a policy line. The supervisor reviews the flagged item — not the whole document — and either approves, edits, or escalates.
Resolve novelty
A new contract type, a new jurisdiction, a counterparty pattern the agent hasn't seen before. The supervisor makes the call, and the decision is captured as playbook data for next time.
Tune the playbook
Every approval, override, and escalation is a signal. The supervisor reviews patterns weekly — what's getting escalated that shouldn't, what's passing through that should be caught — and updates fallback positions, risk thresholds, and escalation rules.
Handle the hard stuff
Strategic negotiations, bet-the-company clauses, internal stakeholder disputes. The agent is explicitly instructed to route these to a human — and to stay out of the way until the human is ready to hand back.

Notice what's not on this list: drafting, reviewing full documents, chasing the business for information, logging outcomes, following up on signatures. The agent does all of that. A supervisor's attention is the scarcest resource in the system, and the system is built to spend it only where it matters.

03 — The curve

Week by week: from day one to mature deployment

The most honest thing we can say about supervision is that it is heavier at the start than it will be later. Prospects who expect day-one autonomy are disappointed. Prospects who brace for a permanent line-by-line review burden are surprised — usually by week four, sometimes earlier.

The supervision workload curve
% of agent outputs a supervisor reviews end-to-end, by deployment week
Supervision workload curve The percentage of agent outputs reviewed end-to-end by a supervisor drops from approximately 90% in week 1 to about 40% in week 3, 10% in week 8, and under 5% by month 3 and beyond. A dashed red baseline at 100% marks the equivalent of doing the work manually. CALIBRATION NARROWING EXCEPTION-BASED STEADY STATE 100% 75% 50% 25% 0% Baseline — doing the work manually (100%) ~90% ~40% ~10% <5% Week 1 Week 3 Week 8 Month 3 Month 6+ Time since deployment
Indicative curve based on typical mid-market NDA / MSA flows across Flank deployments. Higher-stakes commercial matters and regulated workflows converge more slowly. Y-axis is the share of outputs reviewed end-to-end by a supervisor — exception flags, playbook tuning, and strategic escalations are handled separately and sit outside this measure.
1
Week 1 — Calibration
Nearly every output is reviewed. The supervisor is teaching the system what "good" looks like for this team's templates, tolerances, and edge cases. Expect to spend roughly the same time you would have spent doing the work yourself. This is the price of admission, not the steady state.
2
Weeks 2–3 — Narrowing
Confidence scores start clustering higher on known contract types. The supervisor stops reviewing full documents and starts reviewing flagged sections. Review time per contract drops by roughly half. The agent begins handling recurring fallback positions without escalation.
3
Weeks 4–8 — Exception-based
The shape of the work changes. Most contracts pass through without a supervisor touching them. The supervisor's queue becomes a short list of genuine exceptions — novel counterparty positions, edge cases, internal escalations. Time per contract drops to minutes, not hours.
4
Month 3+ — Steady state
The supervisor is no longer a bottleneck in the workflow. Their role shifts from reviewing individual outputs to tuning the system — updating playbooks, adjusting thresholds, reviewing weekly metrics. A team that supervised twenty contracts a week is now supervising a handful of exceptions and spending the rest of their time on work that requires judgement.
~90%
of outputs reviewed in week 1
~40%
reviewed by week 3
~10%
reviewed by month 2
<5%
by month 3+

Figures are typical for a mid-market NDA / MSA flow across Flank customers. Higher-stakes commercial matters and regulated workflows converge more slowly.

04 — The mechanism

Risk-based approval and confidence scoring

The curve in the previous section is not the result of the model getting smarter week over week. It's the result of a workflow that grades every output on two axes — confidence and risk — and uses the grade to decide what reaches a human.

Axis 1
Confidence
How sure is the agent about the output? Derived from the model's own signals, semantic agreement across redundant generations, and retrieval match strength against your playbook. Expressed as a continuous score, not a binary.
Axis 2
Risk
How much does it matter if this output is wrong? Set by the customer at policy level — clause type, contract value, counterparty tier, jurisdiction. The agent inherits this from the playbook, not from its own judgement.

Confidence alone isn't enough. A high-confidence output on a bet-the-company indemnity still warrants a human look; a medium-confidence output on a standard governing-law clause does not. The two axes combine into a risk-based approval matrix that decides, per output, whether it auto-approves, escalates to the agent's own second pass, flags for supervisor review, or goes straight to a named escalation.

In the product, every flagged clause carries a visible reason. The agent surfaces the playbook position it's comparing against ("The playbook maximum is 12 months"), drafts the proposed redline with a diff view (strikethrough in red, replacement in green), and reduces the human decision to a single dismiss / confirm choice — or, where confidence is high enough to justify it, handles the clause silently and reports it in the auto-confirmed lane.

The Flank Review side panel on a mutual NDA. Tabs at the top show Needs review (1), Auto-confirmed (9), and Flank flags (0). The single item needing review is Non-compete Clause 7, titled 'Excessive Non-Compete Scope and Duration,' with an explanation that the clause prohibits competing in any business line for 24 months while the playbook maximum is 12 months limited to directly competing activities in the same market segment. A Suggested change section shows a redline: 'twenty-four (24)' struck through and replaced with 'twelve (12) months,' and 'directly or indirectly engage in any business line that competes with the other party' struck through and replaced with 'directly engage in activities that directly compete with the other party in the same market segment as the Purpose defined herein.' Comments count is 1. Dismiss and Confirm buttons sit at the bottom.
The confidence + risk matrix made visible at the clause level. On this mutual NDA the agent handled nine clauses on its own and surfaced one — an over-wide non-compete that violates the playbook on both duration and scope — with the reasoning spelled out and the suggested redline already drafted.
The ratio that matters

The Review panel reads "Needs review 1 · Auto-confirmed 9 · Flank flags 0." Nine clauses out of ten passed through without a supervisor touching them. The one that didn't arrives pre-diagnosed and pre-drafted — all the human has to do is look, think, and click. That ratio, and the shape of what reaches the queue, is the whole product.

How the thresholds are set

The thresholds that govern "what gets escalated" are not set by Flank. They're set by the customer during sprint zero and tuned across the first month of deployment. A conservative team can dial up escalation on day one, then relax it as trust builds. A more aggressive team can run with lower thresholds from the start. Either way, the supervisor is in control of the tap.

Risk tierTypical thresholdWho sees it
Low — standard clausesAuto-approve if confidence > 90%No human unless flagged
Medium — fallback positionsAuto-approve if confidence > 95% and inside playbookLegal ops spot-check
High — commercial termsAlways flag for reviewSupervising lawyer
Critical — custom / bet-the-companyNever auto-handledNamed escalation
05 — The distinction

Line-by-line review is a symptom, not the job

One of the most common things we hear from teams mid-pilot is some version of: "my supervisors are doing line-by-line review that takes almost as long as doing the work themselves." Almost always, this is a signal that the setup is still in its first three weeks — and that the playbook, thresholds, or escalation rules need tuning, not that supervision itself is broken.

Day one
Line-by-line review
The supervisor reviews every clause because the system hasn't yet earned the right to pass anything through without a check. Workload is roughly equivalent to doing the work yourself. This is the bootstrapping period — necessary, finite, and the only way to get to the next stage.
Mature deployment
Exception-based supervision
The supervisor reviews only what the system can't confidently resolve, or what policy requires a human sees. Most contracts pass through untouched. The supervisor's time is spent on genuine judgement calls, not a rubber-stamp exercise on standard work.
The Review side panel in use. The supervisor works through the "Needs review" queue on a real MSA — approving the proposed redlines where the playbook supports the change, editing where it doesn't, dismissing false positives. Auto-confirmed clauses and proactive Flank flags sit in their own lanes alongside. By the end of the clip, the queue has emptied and the contract is ready to send.

The discomfort in the first three weeks is real, and we don't hide it from prospects. The question any evaluator should ask isn't "how much supervision will I have to do?" — it's "is the system designed to reduce the supervision workload over time, and is there evidence it actually does?" The answer to both should be yes, with data to back it up.

The honest framing

If your supervision workload isn't dropping meaningfully between week one and week four, something is wrong with the setup, not with the concept. Playbooks too thin, thresholds mistuned, or the work doesn't actually fit the agent yet. In every case, the fix is visible in the data — which is why we run a weekly review across the first eight weeks of every deployment.

06 — The roles

Who supervises what

Supervision is not a single job. In most deployments, the work splits across three roles that already exist in the legal team — lawyers, legal ops, and legal admin. The agent routes each class of exception to the role best placed to handle it.

RoleSupervisesTypical time commitment
Lawyer Commercial terms, legal exceptions, novel counterparty positions, escalations that touch risk or strategy. ~30 min/day at steady state. Concentrated on high-value judgement calls, not volume.
Legal ops Playbook maintenance, threshold tuning, weekly metrics review, exception patterns, integration health. ~2–3 hours/week. Treats the agent like any other operational system under their remit.
Legal admin / paralegal Low-risk procedural checks, counterparty data validation, signature routing, internal stakeholder follow-ups the agent flags for a human touch. ~1 hour/day. Much of this work simply wasn't happening reliably before.

The important point: no role is doing work beneath its seniority. The lawyer is not reading NDAs line by line. The legal ops lead is not chasing counterparty signatures. The admin is not making commercial judgement calls. This separation is the whole point, and it's what collapses when teams try to wrap a base model themselves and end up with one person approving everything the model produces.

07 — The standard

What "supervised to near-zero error" looks like

Accuracy metrics in isolation mislead. A 93% accurate agent sounds worrying; a 93% accurate agent behind a well-tuned supervision layer is, in production, closer to 99.5% — because the 7% the agent gets wrong is disproportionately the 7% it's least confident about, which is disproportionately what gets escalated to a human.

The right framing

Supervision is not a tax you pay on agent inaccuracy. It's the mechanism that converts model accuracy into system reliability. The raw model output is the floor. The supervised output is the ceiling. The gap between the two is where the product actually lives.

What we track, and what we share with customers

Every supervised action is logged against the counterfactual — would a human reviewer have caught what the agent caught, would they have flagged what the agent flagged, would they have drafted the redline the agent drafted. Every week of a deployment produces a supervision quality report covering:

Agent-side metrics
  • Confidence calibration — does 90% confidence actually mean 90% correct?
  • Escalation rate by contract type and risk tier
  • Playbook coverage — what fraction of outputs fall inside the playbook vs novel
  • Redline acceptance rate from counterparties
Supervision-side metrics
  • Supervisor override rate — how often humans disagree with the agent
  • Median review time per flagged output
  • Escalation-to-resolution latency
  • Drift markers — are override rates trending up or down over time?
Supervision quality report — sample, week 8 of deployment
Illustrative figures for a mid-market SaaS contracting flow (NDAs, MSAs, DPAs, vendor T&Cs)
Sample supervision quality report dashboard A four-panel dashboard showing confidence calibration against a diagonal reference, an eight-week decline in the supervisor override rate, escalation rates by contract type, and headline ratios including counterparty acceptance and median review time. Confidence calibration Stated confidence vs. actual correctness, last 7 days PERFECT CALIBRATION 50% 100% 50% 100% Stated confidence Supervisor override rate % of flagged outputs the supervisor disagreed with 18% 4% 0% 20% Wk 1 Wk 8 Escalation rate by contract type Share of clauses routed to a supervisor, week 8 NDAs 3% MSAs 8% DPAs 12% Vendor T&Cs 6% 0% 6% 12% Headline ratios What week 8 of this deployment looks like, at a glance 87% Counterparty redline acceptance Of all suggested changes sent out 4.2 min Median supervisor review time Per flagged clause 9:1 Auto-confirmed : needs review Clause-level ratio across all contracts this week
Every Flank deployment produces a weekly report along these lines. Figures shown are representative of a well-tuned week-8 deployment — real reports include richer drill-downs (confidence calibration by contract type, override patterns by supervisor, latency distributions, and drift markers against the prior eight weeks).
08 — The transformation

Day one vs. month three

The clearest way to describe the supervision experience is to contrast it with itself. Here's what the same legal team looks like at two points in the same deployment.

Day one
  • Supervisor reviews ~90% of outputs end-to-end
  • Playbook captures the team's most common positions but has gaps
  • Escalation thresholds intentionally conservative
  • Confidence calibration still being verified against human ground truth
  • Weekly review meeting with Flank is dense — lots of playbook edits
  • Time spent per contract: comparable to doing the work manually
  • Team still sceptical; waiting for evidence
Month three
  • Supervisor reviews <5% of outputs — only genuine exceptions
  • Playbook is well-developed, covering edge cases as they've been resolved
  • Escalation thresholds tuned to the team's actual risk tolerance
  • Confidence scores well-calibrated against months of override data
  • Weekly review is light — mostly metrics and drift checks
  • Time spent per contract: minutes, for the small fraction that needs it
  • Team is using freed time for work that was previously deferred

This is not an aspirational before-and-after. It's the pattern we see across deployments, roughly on this timeline, when the setup is done properly and the supervisor is engaged in the first month. The deployments that don't resolve like this are the ones where the playbook never gets tuned, or the thresholds are set once and never revisited, or supervision is treated as a one-off project instead of a weekly discipline.

What to ask your vendor

If you're evaluating Flank or any other agentic legal service, don't ask "how accurate is it?" — ask "what does supervision look like in week one, week four, and month three, and can you show me the data for a customer at each stage?" The answer will tell you whether you're buying a tool or a service, and whether the curve actually exists in their deployments.

Subscribe

The Intake

Weekly briefings on what's actually changing in legal AI — the market shifts, regulatory moves, and structural questions that matter for enterprise legal teams.

Subscribe on Substack
Flank

Insource legal work to supervised agents

Enterprise legal teams use Flank to handle high-volume contracting end-to-end — NDAs, MSA redlines, procurement, triage. Agents that know your templates, terms, and escalation rules.

Learn more at flank.ai