A code-level gap analysis comparing the open source legal AI assistant against what an enterprise-grade legal AI platform actually has to do. Every claim cites a specific file, line, or architectural decision in the public MikeOSS repository.
MikeOSS is, by Will Chen's own description, two weeks of work. As a clone of the chat-with-documents surface of Harvey and Legora, it is a credible artifact and a useful pressure point on legal AI pricing. It also highlights how thin the initial "talk to your docs" use cases actually were.
The point of this document here is narrower. It's a (rather rapid) audit of where the MikeOSS codebase falls short of what an enterprise-grade legal AI platform actually has to do, with every gap citing a specific file, line, or architectural decision in the public repository. We're not arguing MikeOSS shouldn't exist. The work it does at the chat-with-documents layer is real and it provides a solid, well designed, and thoughtfully architected starting point for self-build projects. Instead, the aim here is just to make the distance between a working chat-with-documents demo and a production-grade agentic legal system legible, because that distance is mostly invisible at the demo, which is what makes it expensive when an in-house team underestimates it.
Even if you're a huge fan of MikeOSS, this is for you, as this should help better surface the gaps for you to bridge when deploying Mike (or a fork of Mike) into production!
I've organised the thirteen gaps into four areas: architecture and execution model, multi-tenancy and governance, security posture, and operations.
Any thoughts, questions, or challenges, please add me on LinkedIn.
The interesting question here is the delta: what would an in-house team need to build, harden and run on top of this codebase to get from where the repo sits today to something they would put in front of privileged matter work in a regulated organisation. The foundation models and surfaces inside MikeOSS aren't the variable; they're the same models and similar surfaces every vendor uses. The delta is. The rest of this piece walks through it.
Three findings here are about how MikeOSS executes work, and what kinds of work the underlying design forecloses. None of these is the sort of thing you fix by adding features; each would require changing the shape of the system itself.
I'm leading with these simply because they're the biggest delta. No doubt there's a fork out there already reworking some of this.
MikeOSS has no vector store, no embeddings, no RAG pipeline, and no semantic search. Zero dependencies on pgvector, Pinecone, Weaviate, Chroma, or any embedding model. Documents are loaded in full via LLM tool calls. The system prompt explicitly forces re-reading on every turn:
The only "search" is find_in_document, a case-insensitive substring match over raw text (chatTools.ts:288-320 in the 20 May snapshot). No relevance ranking, no fuzzy matching, no cross-document discovery.
A chunking and embedding pipeline (per-document and cross-corpus), vector similarity search for retrieval, hybrid search combining BM25 and dense retrieval, and citation grounding that maps back to source chunks. Without this, MikeOSS cannot handle large document sets, multi-document Q&A, or precedent search.
Every LLM call runs in the HTTP request/response cycle. There is no job queue, no worker process, no async task infrastructure. The backend is a single Express process:
No Bull, no BullMQ, no Redis, no task scheduler. When a user sends a chat message, the server blocks that request until the LLM finishes. This forecloses batch document processing (reviewing 200 contracts against a playbook), long-running agent workflows that outlive a browser session, scheduled or trigger-based agent execution, parallel multi-document analysis, and any operation that takes longer than a reasonable HTTP timeout.
A durable task queue (with retry, dead-letter, and priority), worker processes that can run independently of the API server, and a status/progress API so the frontend can poll or subscribe to job state. Flank uses Temporal for this, with workflows that survive process restarts.
The entire workflow system consists of three static prompt templates:
Workflows are strings pasted into the system prompt. There is no conditional logic, no branching, no multi-step orchestration, no loops, no human-in-the-loop approval gates, and no way to compose workflows from reusable steps. User-created workflows follow the same pattern. The "engine" is: prepend the workflow prompt to the system prompt, then run the normal chat loop.
A workflow engine with multi-step execution (step A's output feeds step B), conditional branching, parallel execution, human-in-the-loop gates, retry and error handling per step, and audit trails per execution. Legal workflows are inherently multi-step and approval-gated. A single-prompt approach cannot model them.
For a platform that handles privileged information across multiple matters, the absence of an organisation model and an audit trail isn't a minor omission. It tells you the system was designed as a single-user tool first.
The database schema has no concept of organisations, teams, or tenants. Users exist as flat rows in user_profiles:
The organisation field is free text with no constraints or relationships. No organisation tables, no team tables, no role assignments. Sharing is email-based via JSONB arrays (shared_with), and access checks match the current user's email against this array. There is no RBAC. The isOwner boolean is the only permission level.
Organisation-scoped data isolation (or per-tenant databases), role-based access (admin, member, viewer, external counsel), team-based sharing, SSO/SCIM provisioning, and audit trails. A legal platform handling privileged information across multiple clients needs strict data boundaries. Email-list sharing does not scale and cannot be audited.
There is no audit log table, no event journal, no record of who accessed what document or what the AI generated. As of the 20 May 2026 snapshot the previous RAW_STREAM_LOG_PATH debug file write has been removed (the original cite at backend/src/lib/llm/claude.ts:83-86 no longer exists in the repo), which is a clear improvement. A structured audit system has not yet been added in its place. There is no audit log table in backend/schema.sql, no event journal, and no chain-of-custody record of what documents were accessed, what prompts were sent, or what edits the AI proposed and whether they were accepted.
The raw debug log artifact has been removed. The structural finding still stands: no immutable audit trail.
Immutable, structured audit logs capturing who initiated each action, what documents were accessed, what the AI generated, what edits were proposed and whether they were accepted, and when. Legal teams operating under professional privilege obligations need to demonstrate chain-of-custody for AI-assisted work product. Regulators increasingly require explainability for AI-generated legal output.
The original finding was that no rate-limiting middleware sat in front of the LLM endpoints. In the 20 May 2026 snapshot this is addressed: express-rate-limit is wired in with tiered limiters for general traffic, chat, chat-create, and uploads, all configurable via environment variables. helmet has also been added for security headers.
What remains open is the deeper finding underneath: extended thinking is still pinned to maximum effort (thinking: { type: "adaptive" }, output_config: { effort: "high" } in backend/src/lib/llm/claude.ts), message_credits_used in user_profiles is still a counter rather than a gate enforced before the LLM call, and there is no per-organisation token budget or cost attribution. The IP-based limiter prevents pathological volume; it does not yet provide cost governance.
Route-level rate limiting and Helmet security headers are now in place. The cost-governance dimension (per-org budgets, pre-call credit enforcement, configurable effort) is not yet addressed.
Some of what follows is typical of early-stage codebases. A few of these findings would prevent a serious procurement team from clearing the platform for production use without rewrites.
No schema validation library (zod, joi, or similar) is used anywhere. Route handlers accept raw req.body and pass it through:
The shared_with array, columns_config JSONB, and messages array are all accepted without type or shape validation. A malformed payload will either crash deep in business logic or corrupt JSONB columns.
Schema validation at every API boundary. Malformed inputs to an LLM pipeline produce cascading failures that are hard to diagnose. In a multi-tenant system, input validation is also a security boundary. Untrusted input must never reach the database or LLM layer unvalidated.
The original finding was that getSecret() silently fell back to the Supabase service-role key, and then to the literal string "dev-secret", if DOWNLOAD_SIGNING_SECRET was unset. As of the 20 May 2026 snapshot the function fails closed:
The fallback chain has been removed. A missing signing secret now throws at startup with a clear remediation message, which is exactly the fail-fast behaviour the original requirement called for.
The original finding cited frontend/src/lib/supabase-server.ts using SUPABASE_SECRET_KEY (the service-role key, which bypasses RLS) inside the Next.js source tree. As of the 20 May 2026 snapshot that file is gone. The frontend now ships frontend/src/lib/auth.ts, which validates Bearer tokens against Supabase using the publishable key (NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY) only. The service-role key is now confined to the backend in backend/src/lib/supabase.ts and backend/src/middleware/auth.ts.
Frontend exposure of the service-role key has been eliminated, which was the main risk in the original finding. One secondary point still stands: Row Level Security is still only applied to user_profiles, not to projects, documents, chats, workflows, or tabular_reviews. All access control on those tables is still enforced at the application layer only, which leaves the defence-in-depth gap open.
The original finding was that the frontend auth helper, when Supabase credentials were missing, silently accepted the raw Bearer token as the user ID. That meant a misconfigured production deploy would authenticate any string as any user. As of the 20 May 2026 snapshot the responsibility has moved to backend/src/middleware/auth.ts, which fails closed on every path:
Missing credentials now produce a 500, invalid tokens produce a 401, and the "accept raw token as user ID" fallback is gone. This is the fail-closed behaviour the original requirement called for.
User messages are interpolated directly into LLM prompts without isolation. The title generation endpoint:
Document context is built by appending user-supplied filenames and content into the system prompt. There are no isolation markers, no content filtering, no output validation. A malicious document title or filename could influence the system prompt.
Defence-in-depth against prompt injection: input sanitisation, prompt/data boundary markers, output validation against expected schemas, and monitoring for anomalous LLM behaviour. In a legal context, a compromised agent generating incorrect contract terms or leaking privileged information across matters is a professional liability issue.
The last two gaps are about whether the system can plug into the tooling legal teams already use, and whether anyone would notice when it produced incorrect output for a client matter.
MikeOSS is entirely self-contained. The only external services are Supabase, Cloudflare R2, and the LLM APIs. There is no integration with:
resend is in package.json but unused (zero grep matches for actual invocation).Legal teams don't adopt tools in isolation. A platform has to ingest documents from where they already live (DMS), communicate through existing channels (email, Slack), and write results back to systems of record (matter management, CLM). Without integrations, adoption means manual file upload and download for every interaction, which is a non-starter for high-volume legal operations.
Monitoring is limited to console.log statements scattered through the codebase and the raw stream log file. No structured log formats (JSON logging with correlation IDs), no request tracing, no error tracking, no performance metrics, no alerting on failures or anomalies, and no health checks beyond a trivial GET /health → { ok: true }.
When an AI agent produces incorrect output for a client matter, the team needs to reconstruct exactly what happened: what documents were read, what the LLM was prompted with, what tools it called, what it generated, and who saw the output. Without observability, debugging is guesswork.
The table below collapses the analysis as of the 20 May 2026 snapshot. The left column is the capability. The middle column is what MikeOSS currently implements. The right column is what an enterprise legal AI platform has to provide. Rows marked closed reflect findings the maintainer had already resolved by the time this page was refreshed against the 20 May snapshot.
| Capability | MikeOSS | Enterprise requirement |
|---|---|---|
| Document retrieval | Full context-stuffing via tool calls | Vector search, hybrid retrieval, cross-corpus |
| Execution model | Synchronous request/response | Durable background jobs, batch processing |
| Workflow engine | Prompt template prepended to system prompt | Multi-step, conditional, approval-gated |
| Multi-tenancy | Email-list sharing, no org model | Org-scoped isolation, RBAC, SSO/SCIM |
| Audit and compliance | Raw stream log artifact removed; no audit table yet | Structured, immutable audit trail |
| Rate limiting | Tiered express-rate-limit on chat / upload routes; cost governance still open | Per-user, per-org, with cost attribution |
| Input validation | None | Schema validation at every boundary |
| Secret management | Signing secret now fails fast on missing env | Fail-fast, rotatable, scoped secrets |
| Data isolation | Service key removed from frontend tree; RLS coverage still partial | RLS plus app-level, service key server-only |
| Auth posture | Backend middleware fails closed on missing credentials / invalid tokens | Fail closed on missing credentials |
| Prompt security | Direct interpolation of user input | Isolation markers, output validation, monitoring |
| Integrations | None (self-contained) | DMS, email, Slack, CLM, eSignature |
| Observability | console.log | Structured logging, tracing, alerting |
None of this makes MikeOSS uninteresting. Will Chen has made the chat-with-documents-and-tabular-extract layer free, and that layer is a meaningful share of what Harvey and Legora actually deliver at the demo. The pressure that puts on their pricing is real. The 20 May 2026 refresh of this page also makes another point worth saying out loud. Between the first publication and this snapshot, the maintainer has closed several of the security findings: rate limiting, signing-secret fallback, frontend service-key exposure, silent auth bypass. That is the open-source teardown loop working as intended, and credit is due.
What the remaining gaps make visible is a different point. Adding features doesn't close the distance between a working chat assistant and a production agent platform. That distance gets closed through years of work on retrieval, durable orchestration, multi-tenant governance, deeper security hardening, integrations, and observability. That work is what an in-house team would have to absorb if it took MikeOSS as a starting point, and most of it is invisible at the demo. Almost none of it is on a roadmap that a two-week project can credibly own.
Picking through an open-source codebase to surface its rough edges has a long and useful history. Early Linux kernel critiques pushed the scheduler, VM, and SMP work that hardened the kernel for production. The OpenSSL post-Heartbleed reviews drove LibreSSL, BoringSSL, and the modern audit culture around cryptographic code. Public teardowns of MongoDB's defaults, Kubernetes' security posture, Log4j's surface area, and countless framework releases (React, Rails, Django, Next.js) have repeatedly closed the gap between "interesting project" and "thing you can run a regulated business on".
The pattern is consistent. The project ships, the community pressure-tests it in public, the maintainers (or a fork) absorb the findings, and a year or two later the production-grade version exists because of that scrutiny, not in spite of it. That is the spirit of this analysis. MikeOSS is a credible piece of work, and the gaps documented here are the same gaps every early open-source project has had to grow through to become something an enterprise can trust.