Flank research

The GC's guide to evaluating agentic legal services

A framework for enterprise legal leaders assessing whether autonomous agents belong in their team's operating model, and how to evaluate providers without getting lost in the hype.

Reading time 12 minutes
Published April 2026
Audience GCs, CLOs, Heads of Legal Ops
01 — The category is real, but the language is not yet settled

What "agentic" actually means in legal

Every vendor in legal AI claims to be "agentic" in 2026. The term is diluting fast. But the underlying category shift is genuine and worth understanding before you evaluate any provider.

The simplest way to think about it: a copilot assists a lawyer who is doing the work. An agent does the work, with the lawyer setting rules and supervising the output. The test is practical. Does the lawyer still need to open a tool, paste in text, and process the result? If yes, you have a copilot. Does the work arrive in the lawyer's supervision queue already completed? That is an agent.

This distinction matters for a specific reason. Copilots make individual lawyers faster, typically by 20 to 30 percent on a given task. That is valuable, but it does not change the staffing model. Every contract still requires a human to pick it up, process it, and send it back. The throughput ceiling is still bounded by how many lawyers you have.

Agents change the model itself. Routine work is removed from the human queue entirely. The lawyer's role shifts from executing to supervising. The throughput ceiling moves from headcount to agent capacity, which is elastic. This is not a productivity improvement. It is a structural change in how legal work gets delivered.

Whether that structural change is right for your team depends on what kind of work you do, how much of it is routine, and whether your current model is actually breaking. If your team handles 200 NDAs a month with a three-person dedicated team and the backlog is growing, you are a candidate. If your team does primarily bespoke advisory work with occasional contract reviews, you probably are not.

The gap that matters

An ACC/Everlaw survey found that 64% of in-house teams expect to reduce outside counsel reliance through AI. But only 7% report actually seeing a reduction in total matter cost. That gap between expectation and reality is the entire argument for why copilots alone are insufficient and why the market is moving toward agents.

02 — Four tests for evaluating providers

What to ask and what the answers mean

If you are evaluating an agentic legal services provider, there are four questions that cut through positioning language and reveal what the product actually does. Any vendor worth considering should be able to answer all four clearly.

Test 1: Does it execute autonomously under guardrails?

The question is whether the system completes work from intake to resolution without requiring human approval at every step. Some platforms chain tasks together but still require a lawyer to click "approve" between each stage. That is task automation with a human in the middle, not autonomous execution. True autonomous execution means the agent receives a request, applies rules, produces output, and delivers it. The human sets the rules and reviews exceptions, not every output.

Test 2: Does it operate on your own legal logic?

Does the system use your specific templates, your preferred terms, your fallback positions, your escalation rules? Or does it rely on generic training data and require your team to learn prompt engineering? The difference is significant. A platform that operates on generic legal knowledge is a research assistant. A platform that operates on your institutional knowledge is an agent that works like a member of your team.

Test 3: Can a single agent perform multiple functions in a workflow?

Some vendors sell "agents" that perform a single function. One agent for triage. Another for review. Another for drafting. That is a point-solution suite, not an agentic system. The test is whether a single agent can triage a request, select the right template, draft the document, apply jurisdiction-specific clauses, and route the output, all as a connected workflow.

Test 4: Is legal supervision built into the product?

This is the test most buyers undervalue and most vendors fail. Supervision is not "we have an audit log." Supervision means tenured legal practitioners review agent output, refine the underlying rules as patterns emerge, and close gaps between what the agent does and what your team's standard requires. It is quality control as a product feature, not an afterthought.

Ask the vendor: who supervises the agent's work? If the answer is "your team," that might be fine if you have capacity. If the answer is "nobody, the AI is accurate enough," walk away. If the answer includes a defined supervision model with options for who handles it and how quality is measured, you are in the right conversation.

A practical shortcut

Send the vendor a real NDA from your business, with your standard terms and your counterparty's redlines. Ask them to show you what the agent produces. The output will tell you more than any demo or slide deck. Does it look like what your team would produce? Does it apply your positions correctly? Does the supervision queue flag the right exceptions?

03 — The delivery model question

Own it or outsource it

Agentic legal services come in two structural forms, and the choice between them is more important than any feature comparison.

The first form is what you might call the platform model. The vendor deploys agents inside your environment. The agents run on your playbooks, your templates, your escalation rules. Your team (or a partner firm, or the vendor's supervision team) supervises the output. The capability belongs to you. If you change vendors, the playbooks and institutional knowledge stay.

The second form is the service model. The vendor employs the agents (and often human lawyers) and delivers completed work. You send contracts in, you get work back. The vendor handles everything: the AI, the supervision, the quality control. From your perspective, it looks like hiring a very fast, very consistent ALSP. The capability lives inside the vendor's organisation.

Neither model is inherently superior. The platform model gives you more control and builds internal capability over time. The service model requires less change and less internal capacity. The right choice depends on your operating posture.

If your team has the capacity to supervise and wants to own the playbook logic, the platform model is usually the better fit. If your team is already stretched and wants the work done without adding internal process, the service model may be more practical. Some buyers start with the service model and migrate to the platform model as they build confidence.

DimensionPlatform modelService model
Where capability livesInside your teamInside the vendor
Who supervisesYour lawyers (or your chosen partner)The vendor's lawyers
Knowledge ownershipPlaybooks belong to you. PortableKnowledge compounds inside vendor. Less portable
Change managementSome: playbook definition, supervision setupMinimal: send work, receive output
Control over qualityDirect: you define and enforce standardsIndirect: you rely on vendor's quality model
Switching costLower: you own the configurationHigher: operational knowledge locked in vendor
04 — Running a pilot

How to test without committing

The most common mistake in evaluating agentic legal services is trying to evaluate the entire category theoretically. A pilot on a single, bounded workflow tells you more than months of vendor presentations.

The ideal pilot scope is a high-volume, low-risk workflow. NDAs are the canonical example: high volume, standard terms, clear playbook, measurable turnaround. Other good candidates include vendor agreement reviews, procurement contract first-pass, and intake triage.

A well-designed pilot should run for 4 to 6 weeks and measure three things. First, accuracy: does the agent's output match what your team would produce, on your templates, with your positions? Second, turnaround: what is the time from request to delivery, and how does it compare to your current process? Third, supervision load: how much time does your team spend reviewing and correcting agent output, and does that load decrease over time as the playbooks improve?

Resist the temptation to pilot on your most complex workflow. The point of the pilot is not to find the limits of the technology. It is to prove the model works on the work that constitutes the majority of your team's volume. Start with the 80% that is routine. If that works, the conversation about the remaining 20% becomes much easier.

Pilot success criteria

A good pilot answer three questions: Does the output quality meet our standard? (Review a sample of agent outputs against what your team would produce.) Does the capacity model hold? (Can one supervisor cover the output volume that previously required multiple people?) Does the team trust it? (Are your lawyers comfortable with the supervision model, or are they re-doing the agent's work?)

05 — What to ask about security and data

The procurement conversation

Any agentic legal services provider processes your contracts, your templates, and your counterparty data. The security conversation is not optional and should happen early, not after you have already decided to buy.

The questions that matter: is the architecture single-tenant or multi-tenant? (Single-tenant means your data is isolated by design, not by access control.) Where does the data reside, and can you specify the region? Which LLM providers process your data, and under what data processing agreements? Is your data used to train any model? What certifications does the vendor hold (SOC 2 Type II is the baseline for enterprise)? What is the supervision and audit trail for every agent action?

For EU-headquartered organisations, GDPR compliance is not a checkbox. Ask whether the vendor is EU-based or US-based, where data processing happens by default, whether a Data Processing Agreement is available, and whether the vendor has dealt with EU procurement processes before. A vendor that treats GDPR as an afterthought will create problems for your DPO.

06 — The budget conversation

Where the money comes from

One of the underappreciated aspects of agentic legal services is that the budget already exists. If your team outsources routine work to law firms or ALSPs, that spend is the displacement target. This is not a new software budget request. It is a reallocation of existing services spend.

The framing that resonates with CFOs: "service budget, software scale." The cost profile looks like outsourcing (variable, per-unit, no headcount approval needed). The scaling profile looks like software (capacity grows without proportional cost increase). For organisations spending significantly on outside counsel for routine contracting, the ROI case is straightforward because the baseline cost is high and well-documented.

If your organisation does not currently outsource routine legal work, the budget conversation is different. You are making the case for capacity expansion without headcount growth. The ROI is measured in throughput (how much more work the team handles) and lawyer reallocation (how much strategic work the team can now do). Both are real but harder to quantify than direct cost displacement.

Subscribe

The Intake

Weekly briefings on what's actually changing in legal AI — the market shifts, regulatory moves, and structural questions that matter for enterprise legal teams.

Subscribe on Substack
Flank

Insource legal work to supervised agents

Enterprise legal teams use Flank to handle high-volume contracting end-to-end — NDAs, MSA redlines, procurement, triage. Agents that know your templates, terms, and escalation rules.

Learn more at flank.ai