Flank research — technical analysis

Where MikeOSS falls short

A code-level gap analysis comparing the open source legal AI assistant against what an enterprise-grade legal AI platform actually has to do. Every claim cites a specific file, line, or architectural decision in the public MikeOSS repository.

Last updated 20 May 2026
Snapshot repo state at 20 May 2026
Classification Internal
Source github.com/willchen96/mike
01 — Why this analysis exists

The distance between a working demo and a production platform

Refreshed · 20 May 2026

Update: After a response shared by the repo owner, this teardown has been brought in-line with the latest version. The initial teardown was based on the initial release of Mike. On May 8th, a number of patches were released and they have been reviewed below. I would also love to draw attention to Scott Kveton's own exploration of Mike (and his related fork) at https://www.linkedin.com/pulse/missing-middle-legal-ai-scott-kveton-tw6gc/.

  • Gap 06 (rate limiting): Will's added express-rate-limit with sensible per-route caps (general, chat, chat-create, uploads) in backend/src/index.ts, and Helmet is now on too. Closed.
  • Gap 08 (hardcoded signing secret): the dev fallback is dead. If DOWNLOAD_SIGNING_SECRET isn't set, downloadTokens.ts refuses to boot. No more "dev-secret" or sneaky service-key fallback. Closed.
  • Gap 09 (service role key in the frontend): supabase-server.ts has been deleted. The service key only touches the backend now (backend/src/lib/supabase.ts, backend/src/middleware/auth.ts). The deeper RLS-coverage point still stands, but the frontend leak surface is gone. Closed.
  • Gap 10 (silent auth bypass): auth fails closed properly. Missing env vars throw a 500, bad tokens get a 401, and the old "raw token = user ID" footgun has been ripped out. Closed.
  • Gap 05 (raw stream log artifact): the debug log write is gone, which is the right move, but nothing structured has replaced it yet. So this one's reframed rather than closed.

For context, the items closed above are security-hygiene fixes. The broader architectural analysis below still describes the snapshot.

MikeOSS is, by Will Chen's own description, two weeks of work. As a clone of the chat-with-documents surface of Harvey and Legora, it is a credible artifact and a useful pressure point on legal AI pricing. It also highlights how thin the initial "talk to your docs" use cases actually were.

The point of this document here is narrower. It's a (rather rapid) audit of where the MikeOSS codebase falls short of what an enterprise-grade legal AI platform actually has to do, with every gap citing a specific file, line, or architectural decision in the public repository. We're not arguing MikeOSS shouldn't exist. The work it does at the chat-with-documents layer is real and it provides a solid, well designed, and thoughtfully architected starting point for self-build projects. Instead, the aim here is just to make the distance between a working chat-with-documents demo and a production-grade agentic legal system legible, because that distance is mostly invisible at the demo, which is what makes it expensive when an in-house team underestimates it.

Even if you're a huge fan of MikeOSS, this is for you, as this should help better surface the gaps for you to bridge when deploying Mike (or a fork of Mike) into production!

I've organised the thirteen gaps into four areas: architecture and execution model, multi-tenancy and governance, security posture, and operations. 

Any thoughts, questions, or challenges, please add me on LinkedIn.

The underlying question

The interesting question here is the delta: what would an in-house team need to build, harden and run on top of this codebase to get from where the repo sits today to something they would put in front of privileged matter work in a regulated organisation. The foundation models and surfaces inside MikeOSS aren't the variable; they're the same models and similar surfaces every vendor uses. The delta is. The rest of this piece walks through it.

02 — Architecture and execution model

The three structural decisions that bound what MikeOSS can do

Three findings here are about how MikeOSS executes work, and what kinds of work the underlying design forecloses. None of these is the sort of thing you fix by adding features; each would require changing the shape of the system itself.

I'm leading with these simply because they're the biggest delta. No doubt there's a fork out there already reworking some of this.

Gap 01No retrieval layer, context-stuffing only

MikeOSS has no vector store, no embeddings, no RAG pipeline, and no semantic search. Zero dependencies on pgvector, Pinecone, Weaviate, Chroma, or any embedding model. Documents are loaded in full via LLM tool calls. The system prompt explicitly forces re-reading on every turn:

You do NOT retain document content between conversation turns. You MUST call read_document (or fetch_documents) at the start of every response that involves a document's content, even if you have read it in a previous turn.
backend/src/lib/chatTools.ts:661 (20 May 2026 snapshot)

The only "search" is find_in_document, a case-insensitive substring match over raw text (chatTools.ts:288-320 in the 20 May snapshot). No relevance ranking, no fuzzy matching, no cross-document discovery.

What enterprise requires

A chunking and embedding pipeline (per-document and cross-corpus), vector similarity search for retrieval, hybrid search combining BM25 and dense retrieval, and citation grounding that maps back to source chunks. Without this, MikeOSS cannot handle large document sets, multi-document Q&A, or precedent search.

Gap 02Synchronous-only execution, no background jobs

Every LLM call runs in the HTTP request/response cycle. There is no job queue, no worker process, no async task infrastructure. The backend is a single Express process:

app.listen(PORT, () => { console.log(`Mike backend running on port ${PORT}`); });
backend/src/index.ts:124-126 (20 May 2026 snapshot)

No Bull, no BullMQ, no Redis, no task scheduler. When a user sends a chat message, the server blocks that request until the LLM finishes. This forecloses batch document processing (reviewing 200 contracts against a playbook), long-running agent workflows that outlive a browser session, scheduled or trigger-based agent execution, parallel multi-document analysis, and any operation that takes longer than a reasonable HTTP timeout.

What enterprise requires

A durable task queue (with retry, dead-letter, and priority), worker processes that can run independently of the API server, and a status/progress API so the frontend can poll or subscribe to job state. Flank uses Temporal for this, with workflows that survive process restarts.

Gap 03Three hardcoded workflows, no workflow engine

The entire workflow system consists of three static prompt templates:

export const BUILTIN_WORKFLOWS: { id: string; title: string; prompt_md: string }[] = [ { id: "builtin-cp-checklist", title: "Generate CP Checklist", ... }, { id: "builtin-credit-summary", title: "Credit Agreement Summary", ... }, { id: "builtin-sha-summary", title: "Shareholder Agreement Summary", ... }, ];
backend/src/lib/builtinWorkflows.ts:1-76

Workflows are strings pasted into the system prompt. There is no conditional logic, no branching, no multi-step orchestration, no loops, no human-in-the-loop approval gates, and no way to compose workflows from reusable steps. User-created workflows follow the same pattern. The "engine" is: prepend the workflow prompt to the system prompt, then run the normal chat loop.

What enterprise requires

A workflow engine with multi-step execution (step A's output feeds step B), conditional branching, parallel execution, human-in-the-loop gates, retry and error handling per step, and audit trails per execution. Legal workflows are inherently multi-step and approval-gated. A single-prompt approach cannot model them.

03 — Multi-tenancy and governance

The system was not designed for shared, supervised, auditable use

For a platform that handles privileged information across multiple matters, the absence of an organisation model and an audit trail isn't a minor omission. It tells you the system was designed as a single-user tool first.

Gap 04No multi-tenancy or organisation model

The database schema has no concept of organisations, teams, or tenants. Users exist as flat rows in user_profiles:

create table if not exists public.user_profiles ( id uuid primary key default gen_random_uuid(), user_id uuid not null unique references auth.users(id) on delete cascade, display_name text, organisation text, -- free-text field, not a foreign key tier text not null default 'Free', ... );
backend/schema.sql:12-25 (20 May 2026 snapshot; previously backend/migrations/000_one_shot_schema.sql)

The organisation field is free text with no constraints or relationships. No organisation tables, no team tables, no role assignments. Sharing is email-based via JSONB arrays (shared_with), and access checks match the current user's email against this array. There is no RBAC. The isOwner boolean is the only permission level.

What enterprise requires

Organisation-scoped data isolation (or per-tenant databases), role-based access (admin, member, viewer, external counsel), team-based sharing, SSO/SCIM provisioning, and audit trails. A legal platform handling privileged information across multiple clients needs strict data boundaries. Email-list sharing does not scale and cannot be audited.

Gap 05No audit trail or compliance logging

There is no audit log table, no event journal, no record of who accessed what document or what the AI generated. As of the 20 May 2026 snapshot the previous RAW_STREAM_LOG_PATH debug file write has been removed (the original cite at backend/src/lib/llm/claude.ts:83-86 no longer exists in the repo), which is a clear improvement. A structured audit system has not yet been added in its place. There is no audit log table in backend/schema.sql, no event journal, and no chain-of-custody record of what documents were accessed, what prompts were sent, or what edits the AI proposed and whether they were accepted.

What changed since the snapshot reviewed

The raw debug log artifact has been removed. The structural finding still stands: no immutable audit trail.

What enterprise requires

Immutable, structured audit logs capturing who initiated each action, what documents were accessed, what the AI generated, what edits were proposed and whether they were accepted, and when. Legal teams operating under professional privilege obligations need to demonstrate chain-of-custody for AI-assisted work product. Regulators increasingly require explainability for AI-generated legal output.

Gap 06No rate limiting or cost controlsClosed

The original finding was that no rate-limiting middleware sat in front of the LLM endpoints. In the 20 May 2026 snapshot this is addressed: express-rate-limit is wired in with tiered limiters for general traffic, chat, chat-create, and uploads, all configurable via environment variables. helmet has also been added for security headers.

const chatLimiter = makeLimiter({ windowMs: minutes(envInt("RATE_LIMIT_CHAT_WINDOW_MINUTES", 15)), max: envInt("RATE_LIMIT_CHAT_MAX", 30), message: "Too many chat requests. Please try again later.", }); // ... app.post("/chat", chatLimiter); app.post("/projects/:projectId/chat", chatLimiter); app.post("/tabular-review/:reviewId/chat", chatLimiter);
backend/src/index.ts

What remains open is the deeper finding underneath: extended thinking is still pinned to maximum effort (thinking: { type: "adaptive" }, output_config: { effort: "high" } in backend/src/lib/llm/claude.ts), message_credits_used in user_profiles is still a counter rather than a gate enforced before the LLM call, and there is no per-organisation token budget or cost attribution. The IP-based limiter prevents pathological volume; it does not yet provide cost governance.

What changed since the snapshot reviewed

Route-level rate limiting and Helmet security headers are now in place. The cost-governance dimension (per-org budgets, pre-call credit enforcement, configurable effort) is not yet addressed.

04 — Security posture

Five findings, several of which would not survive procurement

Some of what follows is typical of early-stage codebases. A few of these findings would prevent a serious procurement team from clearing the platform for production use without rewrites.

Gap 07No input validation

No schema validation library (zod, joi, or similar) is used anywhere. Route handlers accept raw req.body and pass it through:

// POST handler accepts body as-is const { messages, chat_id, project_id, model, ... } = req.body;
backend/src/routes/chat.ts

The shared_with array, columns_config JSONB, and messages array are all accepted without type or shape validation. A malformed payload will either crash deep in business logic or corrupt JSONB columns.

What enterprise requires

Schema validation at every API boundary. Malformed inputs to an LLM pipeline produce cascading failures that are hard to diagnose. In a multi-tenant system, input validation is also a security boundary. Untrusted input must never reach the database or LLM layer unvalidated.

Gap 08Hardcoded signing secret fallbackClosed

The original finding was that getSecret() silently fell back to the Supabase service-role key, and then to the literal string "dev-secret", if DOWNLOAD_SIGNING_SECRET was unset. As of the 20 May 2026 snapshot the function fails closed:

function getSecret(): string { const secret = process.env.DOWNLOAD_SIGNING_SECRET; if (!secret) { throw new Error( "DOWNLOAD_SIGNING_SECRET must be set. " + "Generate a strong random value (e.g. `openssl rand -hex 32`) and set it in the environment.", ); } return secret; }
backend/src/lib/downloadTokens.ts (20 May 2026 snapshot)
What changed since the snapshot reviewed

The fallback chain has been removed. A missing signing secret now throws at startup with a clear remediation message, which is exactly the fail-fast behaviour the original requirement called for.

Gap 09Service role key in the frontend codebaseClosed

The original finding cited frontend/src/lib/supabase-server.ts using SUPABASE_SECRET_KEY (the service-role key, which bypasses RLS) inside the Next.js source tree. As of the 20 May 2026 snapshot that file is gone. The frontend now ships frontend/src/lib/auth.ts, which validates Bearer tokens against Supabase using the publishable key (NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY) only. The service-role key is now confined to the backend in backend/src/lib/supabase.ts and backend/src/middleware/auth.ts.

What changed since the snapshot reviewed

Frontend exposure of the service-role key has been eliminated, which was the main risk in the original finding. One secondary point still stands: Row Level Security is still only applied to user_profiles, not to projects, documents, chats, workflows, or tabular_reviews. All access control on those tables is still enforced at the application layer only, which leaves the defence-in-depth gap open.

Gap 10Auth bypass in dev modeClosed

The original finding was that the frontend auth helper, when Supabase credentials were missing, silently accepted the raw Bearer token as the user ID. That meant a misconfigured production deploy would authenticate any string as any user. As of the 20 May 2026 snapshot the responsibility has moved to backend/src/middleware/auth.ts, which fails closed on every path:

if (!supabaseUrl || !serviceKey) { res.status(500).json({ detail: "Server auth is not configured" }); return; } const admin = createClient(supabaseUrl, serviceKey, { auth: { persistSession: false }, }); const { data } = await admin.auth.getUser(token); if (!data.user) { res.status(401).json({ detail: "Invalid or expired token" }); return; }
backend/src/middleware/auth.ts (20 May 2026 snapshot)
What changed since the snapshot reviewed

Missing credentials now produce a 500, invalid tokens produce a 401, and the "accept raw token as user ID" fallback is gone. This is the fail-closed behaviour the original requirement called for.

Gap 11No prompt injection defences

User messages are interpolated directly into LLM prompts without isolation. The title generation endpoint:

user: `Generate a concise title...Message: ${message.slice(0, 500)}`
backend/src/routes/chat.ts:404 (20 May 2026 snapshot)

Document context is built by appending user-supplied filenames and content into the system prompt. There are no isolation markers, no content filtering, no output validation. A malicious document title or filename could influence the system prompt.

What enterprise requires

Defence-in-depth against prompt injection: input sanitisation, prompt/data boundary markers, output validation against expected schemas, and monitoring for anomalous LLM behaviour. In a legal context, a compromised agent generating incorrect contract terms or leaking privileged information across matters is a professional liability issue.

05 — Operations and integrations

How the system fits into the rest of the stack, and how you would know if it broke

The last two gaps are about whether the system can plug into the tooling legal teams already use, and whether anyone would notice when it produced incorrect output for a client matter.

Gap 12No integrations

MikeOSS is entirely self-contained. The only external services are Supabase, Cloudflare R2, and the LLM APIs. There is no integration with:

What enterprise requires

Legal teams don't adopt tools in isolation. A platform has to ingest documents from where they already live (DMS), communicate through existing channels (email, Slack), and write results back to systems of record (matter management, CLM). Without integrations, adoption means manual file upload and download for every interaction, which is a non-starter for high-volume legal operations.

Gap 13No observability

Monitoring is limited to console.log statements scattered through the codebase and the raw stream log file. No structured log formats (JSON logging with correlation IDs), no request tracing, no error tracking, no performance metrics, no alerting on failures or anomalies, and no health checks beyond a trivial GET /health → { ok: true }.

What enterprise requires

When an AI agent produces incorrect output for a client matter, the team needs to reconstruct exactly what happened: what documents were read, what the LLM was prompted with, what tools it called, what it generated, and who saw the output. Without observability, debugging is guesswork.

06 — Summary

The thirteen gaps in one view

The table below collapses the analysis as of the 20 May 2026 snapshot. The left column is the capability. The middle column is what MikeOSS currently implements. The right column is what an enterprise legal AI platform has to provide. Rows marked closed reflect findings the maintainer had already resolved by the time this page was refreshed against the 20 May snapshot.

CapabilityMikeOSSEnterprise requirement
Document retrievalFull context-stuffing via tool callsVector search, hybrid retrieval, cross-corpus
Execution modelSynchronous request/responseDurable background jobs, batch processing
Workflow enginePrompt template prepended to system promptMulti-step, conditional, approval-gated
Multi-tenancyEmail-list sharing, no org modelOrg-scoped isolation, RBAC, SSO/SCIM
Audit and complianceRaw stream log artifact removed; no audit table yetStructured, immutable audit trail
Rate limitingTiered express-rate-limit on chat / upload routes; cost governance still openPer-user, per-org, with cost attribution
Input validationNoneSchema validation at every boundary
Secret managementSigning secret now fails fast on missing envFail-fast, rotatable, scoped secrets
Data isolationService key removed from frontend tree; RLS coverage still partialRLS plus app-level, service key server-only
Auth postureBackend middleware fails closed on missing credentials / invalid tokensFail closed on missing credentials
Prompt securityDirect interpolation of user inputIsolation markers, output validation, monitoring
IntegrationsNone (self-contained)DMS, email, Slack, CLM, eSignature
Observabilityconsole.logStructured logging, tracing, alerting
A measured closing

None of this makes MikeOSS uninteresting. Will Chen has made the chat-with-documents-and-tabular-extract layer free, and that layer is a meaningful share of what Harvey and Legora actually deliver at the demo. The pressure that puts on their pricing is real. The 20 May 2026 refresh of this page also makes another point worth saying out loud. Between the first publication and this snapshot, the maintainer has closed several of the security findings: rate limiting, signing-secret fallback, frontend service-key exposure, silent auth bypass. That is the open-source teardown loop working as intended, and credit is due.

What the remaining gaps make visible is a different point. Adding features doesn't close the distance between a working chat assistant and a production agent platform. That distance gets closed through years of work on retrieval, durable orchestration, multi-tenant governance, deeper security hardening, integrations, and observability. That work is what an in-house team would have to absorb if it took MikeOSS as a starting point, and most of it is invisible at the demo. Almost none of it is on a roadmap that a two-week project can credibly own.

A note on teardowns

Open-source teardowns are a long tradition, and a productive one

Picking through an open-source codebase to surface its rough edges has a long and useful history. Early Linux kernel critiques pushed the scheduler, VM, and SMP work that hardened the kernel for production. The OpenSSL post-Heartbleed reviews drove LibreSSL, BoringSSL, and the modern audit culture around cryptographic code. Public teardowns of MongoDB's defaults, Kubernetes' security posture, Log4j's surface area, and countless framework releases (React, Rails, Django, Next.js) have repeatedly closed the gap between "interesting project" and "thing you can run a regulated business on".

The pattern is consistent. The project ships, the community pressure-tests it in public, the maintainers (or a fork) absorb the findings, and a year or two later the production-grade version exists because of that scrutiny, not in spite of it. That is the spirit of this analysis. MikeOSS is a credible piece of work, and the gaps documented here are the same gaps every early open-source project has had to grow through to become something an enterprise can trust.

Subscribe

The Intake

Weekly briefings on what's actually changing in legal AI: the market shifts, regulatory moves, and structural questions that matter for enterprise legal teams.

Subscribe on Substack
Flank

Outsource legal work to supervised agents

Enterprise legal teams use Flank to handle high-volume contracting end-to-end: NDAs, MSA redlines, procurement, triage. Agents that know your templates, terms, and escalation rules.

Learn more at flank.ai