Legal AI Agents: What Works in Production, Not Pilots

By

Komy A.

July 5, 2026

10 min read

AI Agents for Law Firms: What Works in Production vs. Pilot

Why Legal Doesn't Play by the Same Rules as Other Regulated Industries

We've written before about AI agents in insurance, wealth management, and professional services. Legal is a different animal, and if you're a General Counsel or a CTO trying to figure out where AI agents fit into a law firm or legal department, you need to understand why before you touch a single vendor demo.

Finance has regulators. Healthcare has HIPAA. Legal has something neither of those industries deals with in the same way: privilege. Attorney-client privilege isn't a compliance checkbox you can satisfy with encryption and access logs. It's a legal doctrine that can be waived, permanently, by the wrong data flowing through the wrong system. Send privileged material through a third-party API without the right contractual protections, and you may have just given opposing counsel an argument that the privilege no longer applies. That's not a fine. That's the whole case.

Then there's unauthorized practice of law (UPL). Most industries can automate a decision and call it a productivity gain. In law, if an AI system is effectively giving legal advice to a non-lawyer without a licensed attorney supervising the output, you're in UPL territory in a lot of US states, and Singapore has its own version of this concern through the Legal Profession Act. Add to that the bar's ethics rules, which now explicitly extend supervisory duties to AI tools, and you get a picture of why "just deploy an agent" is a much riskier sentence in a law firm than it is in a logistics company.

This is the frame every vendor blog skips. They'll tell you what the agent does. They won't tell you what happens to your malpractice exposure when it's wrong.

What Legal AI Agents Actually Do Today

Strip away the marketing and the actual use cases are narrower and more mechanical than the pitch decks suggest. Legal research and drafting is the biggest one: tools like Harvey and Thomson Reuters CoCounsel pull case law, summarize precedent, and draft first-pass memos or briefs. Contract review is the second major category, where products like Spellbook sit inside Word and flag risky clauses against a firm's playbook. Intake and case management agents (Clio Duo is a good example) triage new matters, extract facts from client intake forms, and route work to the right team.

None of this is exotic anymore. What's underdiscussed is that these tools operate at very different points on the risk spectrum. A contract review agent flagging an indemnification clause for human review is low-stakes: worst case, a lawyer catches an oversight before it matters. An agent drafting language that goes into a court filing without a citation-verification step is a different category of risk entirely, and that's exactly where the well-known failures have happened.

The market has consolidated fast. Thomson Reuters Institute has been tracking adoption trends and the number is climbing every quarter, which tells you legal ops teams aren't waiting for permission anymore. The question most of them haven't answered is which of these three paths gets them there: buy an off-the-shelf platform, build something custom, or bring in a partner who's shipped this kind of system before.

The Real Decision: Buy the Platform, Build Custom, or Bring in a Partner

Buying makes sense when your use case is generic and your risk tolerance matches what the vendor has already hardened. If you're a mid-size firm doing standard contract review with no unusual confidentiality requirements, a tool like Spellbook or Lexis+AI will get you 80% of the value with none of the engineering lift. Don't build what someone else has already spent three years de-risking.

Building custom becomes the right call when your workflow doesn't map cleanly onto a vendor's assumptions. A firm with a heavily customized practice management stack, a specific matter-type taxonomy, or a requirement that no client data ever touch a shared model endpoint often finds that off-the-shelf tools force compromises they can't accept. We've seen this pattern with clients across regulated industries: the moment "our process is a little different" turns into "the vendor's config screen can't do that," you're already on the path to custom work, whether you planned for it or not.

The partner path exists for the group in between, which in our experience is most legal teams. You know roughly what you need. You don't have six engineers to spend a year wiring an agent into iManage, building a citation-verification pipeline, and getting the access controls scoped correctly to matter and client. This is where a team that has actually shipped production AI systems earns its keep, not by selling you a platform, but by building the specific thing your practice needs and handing you something your own engineers can maintain afterward.

The mistake we see most often is skipping straight to "let's build our own Harvey." Almost nobody needs that. What they need is a narrower agent, wired correctly into the systems they already have, with the compliance layer built in from day one instead of bolted on after a near-miss.

Where Legal AI Pilots Break in Production

The pilot always works. Pilots are demos with curated inputs and a forgiving audience. Production is a different environment, and legal has produced some of the most public failures of any regulated vertical.

The case everyone in legal now knows is Mata v. Avianca. Two New York attorneys submitted a brief citing cases that didn't exist, generated by ChatGPT and never checked. They were sanctioned $5,000 by the court in 2023. It wasn't a one-off embarrassment either. In Park v. Kim, the Second Circuit referred an attorney for disciplinary review over hallucinated citations in an appellate brief, meaning appellate courts are now enforcing this too, not just trial courts.

Here's the part vendors don't want you to sit with: this doesn't stop being a problem once you move to "proper" legal AI tools. Stanford RegLab's research found that even purpose-built legal research tools, including Lexis+AI, Westlaw AI-Assisted Research, and Ask Practical Law AI, still produce a meaningful share of hallucinated or unsupported answers. Buying a legal-specific product doesn't buy you out of the hallucination problem. It reduces it, but it doesn't solve it, and any vendor implying otherwise hasn't read their own back-testing data.

The other place pilots break is quieter and shows up later: privilege leakage through retrieval pipelines. A RAG system pulling from a shared document index across matters can surface content from Matter A into a response generated for Matter B if access scoping isn't built at the retrieval layer, not just the UI layer. We've seen this pattern in adjacent regulated industries where multi-tenant data boundaries got treated as a permissions problem instead of an architecture problem. In law, that mistake isn't just a data breach. It's a privilege waiver, and it can taint the underlying matter.

Integration failures round out the list. An agent that works beautifully in isolation but can't write back into iManage or NetDocuments cleanly, or that breaks Clio's matter-numbering conventions, becomes shadow IT that nobody trusts within two months.

The Compliance Layer Vendors Don't Price In

ABA Formal Opinion 512, issued in July 2024, is the document every US legal team building or buying AI agents needs to actually read, not skim. It extends existing Model Rules to generative AI use: competence (Rule 1.1) now includes understanding the tool's limitations, confidentiality (1.6) requires understanding exactly how client data flows through the system, candor to the tribunal (3.3) means you can't file AI-generated content you haven't verified, and the supervision duties in 5.1 and 5.3, originally written for supervising junior associates and non-lawyer staff, now apply to supervising AI output.

That last point matters more than firms realize. A partner who lets an associate file a brief without review is on the hook. The same standard now applies if that associate used an AI agent instead of a junior colleague. The tool doesn't diffuse the responsibility.

Singapore-based legal teams have their own version of this conversation. The Law Society of Singapore has put out professional guidance on lawyers' use of generative AI tools, and firms operating across US and Singapore jurisdictions need to think about data residency the moment client data starts moving through any AI system, not after. Cross-border legal ops at $10M+ companies often assume their existing SOC 2 posture covers this. It usually doesn't, because legal data has confidentiality obligations layered on top of whatever your standard data governance policy already requires.

The NIST AI Risk Management Framework gives you a usable structure here even though it wasn't written for law firms specifically: map where AI touches your matters, measure the actual error and leakage rates, manage the controls around access and human review, and govern the whole thing with clear ownership. Most firms we talk to have skipped straight to deployment without doing the "map" step at all, which is how privilege leakage ends up discovered by opposing counsel instead of by internal audit.

What a Production-Grade Legal AI Agent Actually Requires

If you're going past off-the-shelf tools into anything custom, there's a specific set of things that has to exist before go-live, not as a nice-to-have but as a condition of using the system at all.

Citation verification is non-negotiable for anything touching drafting or research. Every citation the agent produces needs to resolve against an actual, current source before a human ever sees it as "verified," not just plausible-sounding. This is table stakes after Mata v. Avianca, and any custom build that skips it is asking for its own sanctions headline.

Audit trails need to cover the full chain: what data went in, what the model produced, what a human changed, and who signed off, timestamped and immutable. If work product from an agent ends up in a filing and gets challenged, you need to reconstruct exactly how it was generated, not reconstruct it from memory in a deposition.

Access controls have to be scoped to matter and client at the retrieval layer, not the application layer. This is the fix for the privilege leakage problem above. If your RAG pipeline can technically retrieve across matters and you're relying on the UI to hide it, you don't have a control, you have a bug waiting to be found.

Model evaluation needs to be legal-specific, not generic benchmark scores. A model that performs well on general reasoning benchmarks can still hallucinate on statute interpretation or jurisdiction-specific procedure. You need eval sets built from your own firm's actual matter types, checked against known-correct outputs, refreshed regularly.

And human-in-the-loop sign-off has to be a real gate, not a formality. That means the workflow physically requires a licensed attorney's review before anything customer-facing or court-facing ships, with the system logging that the review happened.

A Practical Vendor and Partner Evaluation Checklist

Whether you're buying a platform or contracting a partner to build something custom, the same set of questions separates the teams who've done this before from the ones selling a demo.

Where does client data physically reside, and does the answer change for US versus Singapore or other cross-border clients?
What's the model provenance, and can you get a straight answer about which base model is doing the actual reasoning, versus a wrapper around someone else's API?
What hallucination testing has been done on legal-specific queries, with real numbers, not marketing claims of "near-zero"?
Do they hold SOC 2 Type II, and can they explain, specifically, how it covers AI data flows and not just general infrastructure?
What does indemnification look like if the tool produces a hallucinated citation that makes it into a filing?
How is access scoped between matters at the data layer, and can they show you the architecture, not just describe it?
What happens to your data and your workflows if you want to leave? Vendor lock-in is a bigger problem in legal than most buyers expect, because matter history and document context are expensive to migrate.

Red flags worth walking away from: anyone who can't name a specific hallucination rate for legal queries, anyone unwilling to put indemnification in writing, and anyone who treats "we use GPT-4 under the hood" as a complete answer to a confidentiality question.

Key Takeaways

Legal AI agents work today, at real firms, on real matters. The failures that make headlines aren't proof the technology doesn't work, they're proof that firms deployed it without the compliance architecture legal work actually demands. Privilege, UPL exposure, and bar supervision duties mean the bar for "production-ready" is higher here than in most industries we build for.

Buying off-the-shelf covers a lot of ground for standard use cases. Custom builds make sense when your practice management stack, your data controls, or your risk profile don't fit a vendor's assumptions. And a lot of legal teams land in the middle, needing a partner who's actually built citation verification, scoped access controls, and audit trails before, rather than learning those lessons on your matter.

If you're working through this decision for your firm or legal department and want to compare notes with a team that has shipped production AI agents with real compliance requirements attached, get in touch with Genta AI Solutions.

View all

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect

By

Komy A.

July 5, 2026

10 min read

AI Agents for Law Firms: What Works in Production vs. Pilot

Why Legal Doesn't Play by the Same Rules as Other Regulated Industries

We've written before about AI agents in insurance, wealth management, and professional services. Legal is a different animal, and if you're a General Counsel or a CTO trying to figure out where AI agents fit into a law firm or legal department, you need to understand why before you touch a single vendor demo.

Finance has regulators. Healthcare has HIPAA. Legal has something neither of those industries deals with in the same way: privilege. Attorney-client privilege isn't a compliance checkbox you can satisfy with encryption and access logs. It's a legal doctrine that can be waived, permanently, by the wrong data flowing through the wrong system. Send privileged material through a third-party API without the right contractual protections, and you may have just given opposing counsel an argument that the privilege no longer applies. That's not a fine. That's the whole case.

Then there's unauthorized practice of law (UPL). Most industries can automate a decision and call it a productivity gain. In law, if an AI system is effectively giving legal advice to a non-lawyer without a licensed attorney supervising the output, you're in UPL territory in a lot of US states, and Singapore has its own version of this concern through the Legal Profession Act. Add to that the bar's ethics rules, which now explicitly extend supervisory duties to AI tools, and you get a picture of why "just deploy an agent" is a much riskier sentence in a law firm than it is in a logistics company.

This is the frame every vendor blog skips. They'll tell you what the agent does. They won't tell you what happens to your malpractice exposure when it's wrong.

What Legal AI Agents Actually Do Today

Strip away the marketing and the actual use cases are narrower and more mechanical than the pitch decks suggest. Legal research and drafting is the biggest one: tools like Harvey and Thomson Reuters CoCounsel pull case law, summarize precedent, and draft first-pass memos or briefs. Contract review is the second major category, where products like Spellbook sit inside Word and flag risky clauses against a firm's playbook. Intake and case management agents (Clio Duo is a good example) triage new matters, extract facts from client intake forms, and route work to the right team.

None of this is exotic anymore. What's underdiscussed is that these tools operate at very different points on the risk spectrum. A contract review agent flagging an indemnification clause for human review is low-stakes: worst case, a lawyer catches an oversight before it matters. An agent drafting language that goes into a court filing without a citation-verification step is a different category of risk entirely, and that's exactly where the well-known failures have happened.

The market has consolidated fast. Thomson Reuters Institute has been tracking adoption trends and the number is climbing every quarter, which tells you legal ops teams aren't waiting for permission anymore. The question most of them haven't answered is which of these three paths gets them there: buy an off-the-shelf platform, build something custom, or bring in a partner who's shipped this kind of system before.

The Real Decision: Buy the Platform, Build Custom, or Bring in a Partner

Buying makes sense when your use case is generic and your risk tolerance matches what the vendor has already hardened. If you're a mid-size firm doing standard contract review with no unusual confidentiality requirements, a tool like Spellbook or Lexis+AI will get you 80% of the value with none of the engineering lift. Don't build what someone else has already spent three years de-risking.

Building custom becomes the right call when your workflow doesn't map cleanly onto a vendor's assumptions. A firm with a heavily customized practice management stack, a specific matter-type taxonomy, or a requirement that no client data ever touch a shared model endpoint often finds that off-the-shelf tools force compromises they can't accept. We've seen this pattern with clients across regulated industries: the moment "our process is a little different" turns into "the vendor's config screen can't do that," you're already on the path to custom work, whether you planned for it or not.

The partner path exists for the group in between, which in our experience is most legal teams. You know roughly what you need. You don't have six engineers to spend a year wiring an agent into iManage, building a citation-verification pipeline, and getting the access controls scoped correctly to matter and client. This is where a team that has actually shipped production AI systems earns its keep, not by selling you a platform, but by building the specific thing your practice needs and handing you something your own engineers can maintain afterward.

The mistake we see most often is skipping straight to "let's build our own Harvey." Almost nobody needs that. What they need is a narrower agent, wired correctly into the systems they already have, with the compliance layer built in from day one instead of bolted on after a near-miss.

Where Legal AI Pilots Break in Production

The pilot always works. Pilots are demos with curated inputs and a forgiving audience. Production is a different environment, and legal has produced some of the most public failures of any regulated vertical.

The case everyone in legal now knows is Mata v. Avianca. Two New York attorneys submitted a brief citing cases that didn't exist, generated by ChatGPT and never checked. They were sanctioned $5,000 by the court in 2023. It wasn't a one-off embarrassment either. In Park v. Kim, the Second Circuit referred an attorney for disciplinary review over hallucinated citations in an appellate brief, meaning appellate courts are now enforcing this too, not just trial courts.

Here's the part vendors don't want you to sit with: this doesn't stop being a problem once you move to "proper" legal AI tools. Stanford RegLab's research found that even purpose-built legal research tools, including Lexis+AI, Westlaw AI-Assisted Research, and Ask Practical Law AI, still produce a meaningful share of hallucinated or unsupported answers. Buying a legal-specific product doesn't buy you out of the hallucination problem. It reduces it, but it doesn't solve it, and any vendor implying otherwise hasn't read their own back-testing data.

The other place pilots break is quieter and shows up later: privilege leakage through retrieval pipelines. A RAG system pulling from a shared document index across matters can surface content from Matter A into a response generated for Matter B if access scoping isn't built at the retrieval layer, not just the UI layer. We've seen this pattern in adjacent regulated industries where multi-tenant data boundaries got treated as a permissions problem instead of an architecture problem. In law, that mistake isn't just a data breach. It's a privilege waiver, and it can taint the underlying matter.

Integration failures round out the list. An agent that works beautifully in isolation but can't write back into iManage or NetDocuments cleanly, or that breaks Clio's matter-numbering conventions, becomes shadow IT that nobody trusts within two months.

The Compliance Layer Vendors Don't Price In

ABA Formal Opinion 512, issued in July 2024, is the document every US legal team building or buying AI agents needs to actually read, not skim. It extends existing Model Rules to generative AI use: competence (Rule 1.1) now includes understanding the tool's limitations, confidentiality (1.6) requires understanding exactly how client data flows through the system, candor to the tribunal (3.3) means you can't file AI-generated content you haven't verified, and the supervision duties in 5.1 and 5.3, originally written for supervising junior associates and non-lawyer staff, now apply to supervising AI output.

That last point matters more than firms realize. A partner who lets an associate file a brief without review is on the hook. The same standard now applies if that associate used an AI agent instead of a junior colleague. The tool doesn't diffuse the responsibility.

Singapore-based legal teams have their own version of this conversation. The Law Society of Singapore has put out professional guidance on lawyers' use of generative AI tools, and firms operating across US and Singapore jurisdictions need to think about data residency the moment client data starts moving through any AI system, not after. Cross-border legal ops at $10M+ companies often assume their existing SOC 2 posture covers this. It usually doesn't, because legal data has confidentiality obligations layered on top of whatever your standard data governance policy already requires.

The NIST AI Risk Management Framework gives you a usable structure here even though it wasn't written for law firms specifically: map where AI touches your matters, measure the actual error and leakage rates, manage the controls around access and human review, and govern the whole thing with clear ownership. Most firms we talk to have skipped straight to deployment without doing the "map" step at all, which is how privilege leakage ends up discovered by opposing counsel instead of by internal audit.

What a Production-Grade Legal AI Agent Actually Requires

If you're going past off-the-shelf tools into anything custom, there's a specific set of things that has to exist before go-live, not as a nice-to-have but as a condition of using the system at all.

Citation verification is non-negotiable for anything touching drafting or research. Every citation the agent produces needs to resolve against an actual, current source before a human ever sees it as "verified," not just plausible-sounding. This is table stakes after Mata v. Avianca, and any custom build that skips it is asking for its own sanctions headline.

Audit trails need to cover the full chain: what data went in, what the model produced, what a human changed, and who signed off, timestamped and immutable. If work product from an agent ends up in a filing and gets challenged, you need to reconstruct exactly how it was generated, not reconstruct it from memory in a deposition.

Access controls have to be scoped to matter and client at the retrieval layer, not the application layer. This is the fix for the privilege leakage problem above. If your RAG pipeline can technically retrieve across matters and you're relying on the UI to hide it, you don't have a control, you have a bug waiting to be found.

Model evaluation needs to be legal-specific, not generic benchmark scores. A model that performs well on general reasoning benchmarks can still hallucinate on statute interpretation or jurisdiction-specific procedure. You need eval sets built from your own firm's actual matter types, checked against known-correct outputs, refreshed regularly.

And human-in-the-loop sign-off has to be a real gate, not a formality. That means the workflow physically requires a licensed attorney's review before anything customer-facing or court-facing ships, with the system logging that the review happened.

A Practical Vendor and Partner Evaluation Checklist

Whether you're buying a platform or contracting a partner to build something custom, the same set of questions separates the teams who've done this before from the ones selling a demo.

Where does client data physically reside, and does the answer change for US versus Singapore or other cross-border clients?
What's the model provenance, and can you get a straight answer about which base model is doing the actual reasoning, versus a wrapper around someone else's API?
What hallucination testing has been done on legal-specific queries, with real numbers, not marketing claims of "near-zero"?
Do they hold SOC 2 Type II, and can they explain, specifically, how it covers AI data flows and not just general infrastructure?
What does indemnification look like if the tool produces a hallucinated citation that makes it into a filing?
How is access scoped between matters at the data layer, and can they show you the architecture, not just describe it?
What happens to your data and your workflows if you want to leave? Vendor lock-in is a bigger problem in legal than most buyers expect, because matter history and document context are expensive to migrate.

Red flags worth walking away from: anyone who can't name a specific hallucination rate for legal queries, anyone unwilling to put indemnification in writing, and anyone who treats "we use GPT-4 under the hood" as a complete answer to a confidentiality question.

Key Takeaways

Legal AI agents work today, at real firms, on real matters. The failures that make headlines aren't proof the technology doesn't work, they're proof that firms deployed it without the compliance architecture legal work actually demands. Privilege, UPL exposure, and bar supervision duties mean the bar for "production-ready" is higher here than in most industries we build for.

Buying off-the-shelf covers a lot of ground for standard use cases. Custom builds make sense when your practice management stack, your data controls, or your risk profile don't fit a vendor's assumptions. And a lot of legal teams land in the middle, needing a partner who's actually built citation verification, scoped access controls, and audit trails before, rather than learning those lessons on your matter.

If you're working through this decision for your firm or legal department and want to compare notes with a team that has shipped production AI agents with real compliance requirements attached, get in touch with Genta AI Solutions.

View all

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect

By

Komy A.

July 5, 2026

10 min read

AI Agents for Law Firms: What Works in Production vs. Pilot

Why Legal Doesn't Play by the Same Rules as Other Regulated Industries

We've written before about AI agents in insurance, wealth management, and professional services. Legal is a different animal, and if you're a General Counsel or a CTO trying to figure out where AI agents fit into a law firm or legal department, you need to understand why before you touch a single vendor demo.

Finance has regulators. Healthcare has HIPAA. Legal has something neither of those industries deals with in the same way: privilege. Attorney-client privilege isn't a compliance checkbox you can satisfy with encryption and access logs. It's a legal doctrine that can be waived, permanently, by the wrong data flowing through the wrong system. Send privileged material through a third-party API without the right contractual protections, and you may have just given opposing counsel an argument that the privilege no longer applies. That's not a fine. That's the whole case.

Then there's unauthorized practice of law (UPL). Most industries can automate a decision and call it a productivity gain. In law, if an AI system is effectively giving legal advice to a non-lawyer without a licensed attorney supervising the output, you're in UPL territory in a lot of US states, and Singapore has its own version of this concern through the Legal Profession Act. Add to that the bar's ethics rules, which now explicitly extend supervisory duties to AI tools, and you get a picture of why "just deploy an agent" is a much riskier sentence in a law firm than it is in a logistics company.

This is the frame every vendor blog skips. They'll tell you what the agent does. They won't tell you what happens to your malpractice exposure when it's wrong.

What Legal AI Agents Actually Do Today

Strip away the marketing and the actual use cases are narrower and more mechanical than the pitch decks suggest. Legal research and drafting is the biggest one: tools like Harvey and Thomson Reuters CoCounsel pull case law, summarize precedent, and draft first-pass memos or briefs. Contract review is the second major category, where products like Spellbook sit inside Word and flag risky clauses against a firm's playbook. Intake and case management agents (Clio Duo is a good example) triage new matters, extract facts from client intake forms, and route work to the right team.

None of this is exotic anymore. What's underdiscussed is that these tools operate at very different points on the risk spectrum. A contract review agent flagging an indemnification clause for human review is low-stakes: worst case, a lawyer catches an oversight before it matters. An agent drafting language that goes into a court filing without a citation-verification step is a different category of risk entirely, and that's exactly where the well-known failures have happened.

The market has consolidated fast. Thomson Reuters Institute has been tracking adoption trends and the number is climbing every quarter, which tells you legal ops teams aren't waiting for permission anymore. The question most of them haven't answered is which of these three paths gets them there: buy an off-the-shelf platform, build something custom, or bring in a partner who's shipped this kind of system before.

The Real Decision: Buy the Platform, Build Custom, or Bring in a Partner

Buying makes sense when your use case is generic and your risk tolerance matches what the vendor has already hardened. If you're a mid-size firm doing standard contract review with no unusual confidentiality requirements, a tool like Spellbook or Lexis+AI will get you 80% of the value with none of the engineering lift. Don't build what someone else has already spent three years de-risking.

Building custom becomes the right call when your workflow doesn't map cleanly onto a vendor's assumptions. A firm with a heavily customized practice management stack, a specific matter-type taxonomy, or a requirement that no client data ever touch a shared model endpoint often finds that off-the-shelf tools force compromises they can't accept. We've seen this pattern with clients across regulated industries: the moment "our process is a little different" turns into "the vendor's config screen can't do that," you're already on the path to custom work, whether you planned for it or not.

The partner path exists for the group in between, which in our experience is most legal teams. You know roughly what you need. You don't have six engineers to spend a year wiring an agent into iManage, building a citation-verification pipeline, and getting the access controls scoped correctly to matter and client. This is where a team that has actually shipped production AI systems earns its keep, not by selling you a platform, but by building the specific thing your practice needs and handing you something your own engineers can maintain afterward.

The mistake we see most often is skipping straight to "let's build our own Harvey." Almost nobody needs that. What they need is a narrower agent, wired correctly into the systems they already have, with the compliance layer built in from day one instead of bolted on after a near-miss.

Where Legal AI Pilots Break in Production

The pilot always works. Pilots are demos with curated inputs and a forgiving audience. Production is a different environment, and legal has produced some of the most public failures of any regulated vertical.

The case everyone in legal now knows is Mata v. Avianca. Two New York attorneys submitted a brief citing cases that didn't exist, generated by ChatGPT and never checked. They were sanctioned $5,000 by the court in 2023. It wasn't a one-off embarrassment either. In Park v. Kim, the Second Circuit referred an attorney for disciplinary review over hallucinated citations in an appellate brief, meaning appellate courts are now enforcing this too, not just trial courts.

Here's the part vendors don't want you to sit with: this doesn't stop being a problem once you move to "proper" legal AI tools. Stanford RegLab's research found that even purpose-built legal research tools, including Lexis+AI, Westlaw AI-Assisted Research, and Ask Practical Law AI, still produce a meaningful share of hallucinated or unsupported answers. Buying a legal-specific product doesn't buy you out of the hallucination problem. It reduces it, but it doesn't solve it, and any vendor implying otherwise hasn't read their own back-testing data.

The other place pilots break is quieter and shows up later: privilege leakage through retrieval pipelines. A RAG system pulling from a shared document index across matters can surface content from Matter A into a response generated for Matter B if access scoping isn't built at the retrieval layer, not just the UI layer. We've seen this pattern in adjacent regulated industries where multi-tenant data boundaries got treated as a permissions problem instead of an architecture problem. In law, that mistake isn't just a data breach. It's a privilege waiver, and it can taint the underlying matter.

Integration failures round out the list. An agent that works beautifully in isolation but can't write back into iManage or NetDocuments cleanly, or that breaks Clio's matter-numbering conventions, becomes shadow IT that nobody trusts within two months.

The Compliance Layer Vendors Don't Price In

ABA Formal Opinion 512, issued in July 2024, is the document every US legal team building or buying AI agents needs to actually read, not skim. It extends existing Model Rules to generative AI use: competence (Rule 1.1) now includes understanding the tool's limitations, confidentiality (1.6) requires understanding exactly how client data flows through the system, candor to the tribunal (3.3) means you can't file AI-generated content you haven't verified, and the supervision duties in 5.1 and 5.3, originally written for supervising junior associates and non-lawyer staff, now apply to supervising AI output.

That last point matters more than firms realize. A partner who lets an associate file a brief without review is on the hook. The same standard now applies if that associate used an AI agent instead of a junior colleague. The tool doesn't diffuse the responsibility.

Singapore-based legal teams have their own version of this conversation. The Law Society of Singapore has put out professional guidance on lawyers' use of generative AI tools, and firms operating across US and Singapore jurisdictions need to think about data residency the moment client data starts moving through any AI system, not after. Cross-border legal ops at $10M+ companies often assume their existing SOC 2 posture covers this. It usually doesn't, because legal data has confidentiality obligations layered on top of whatever your standard data governance policy already requires.

The NIST AI Risk Management Framework gives you a usable structure here even though it wasn't written for law firms specifically: map where AI touches your matters, measure the actual error and leakage rates, manage the controls around access and human review, and govern the whole thing with clear ownership. Most firms we talk to have skipped straight to deployment without doing the "map" step at all, which is how privilege leakage ends up discovered by opposing counsel instead of by internal audit.

What a Production-Grade Legal AI Agent Actually Requires

If you're going past off-the-shelf tools into anything custom, there's a specific set of things that has to exist before go-live, not as a nice-to-have but as a condition of using the system at all.

Citation verification is non-negotiable for anything touching drafting or research. Every citation the agent produces needs to resolve against an actual, current source before a human ever sees it as "verified," not just plausible-sounding. This is table stakes after Mata v. Avianca, and any custom build that skips it is asking for its own sanctions headline.

Audit trails need to cover the full chain: what data went in, what the model produced, what a human changed, and who signed off, timestamped and immutable. If work product from an agent ends up in a filing and gets challenged, you need to reconstruct exactly how it was generated, not reconstruct it from memory in a deposition.

Access controls have to be scoped to matter and client at the retrieval layer, not the application layer. This is the fix for the privilege leakage problem above. If your RAG pipeline can technically retrieve across matters and you're relying on the UI to hide it, you don't have a control, you have a bug waiting to be found.

Model evaluation needs to be legal-specific, not generic benchmark scores. A model that performs well on general reasoning benchmarks can still hallucinate on statute interpretation or jurisdiction-specific procedure. You need eval sets built from your own firm's actual matter types, checked against known-correct outputs, refreshed regularly.

And human-in-the-loop sign-off has to be a real gate, not a formality. That means the workflow physically requires a licensed attorney's review before anything customer-facing or court-facing ships, with the system logging that the review happened.

A Practical Vendor and Partner Evaluation Checklist

Whether you're buying a platform or contracting a partner to build something custom, the same set of questions separates the teams who've done this before from the ones selling a demo.

Where does client data physically reside, and does the answer change for US versus Singapore or other cross-border clients?
What's the model provenance, and can you get a straight answer about which base model is doing the actual reasoning, versus a wrapper around someone else's API?
What hallucination testing has been done on legal-specific queries, with real numbers, not marketing claims of "near-zero"?
Do they hold SOC 2 Type II, and can they explain, specifically, how it covers AI data flows and not just general infrastructure?
What does indemnification look like if the tool produces a hallucinated citation that makes it into a filing?
How is access scoped between matters at the data layer, and can they show you the architecture, not just describe it?
What happens to your data and your workflows if you want to leave? Vendor lock-in is a bigger problem in legal than most buyers expect, because matter history and document context are expensive to migrate.

Red flags worth walking away from: anyone who can't name a specific hallucination rate for legal queries, anyone unwilling to put indemnification in writing, and anyone who treats "we use GPT-4 under the hood" as a complete answer to a confidentiality question.

Key Takeaways

Legal AI agents work today, at real firms, on real matters. The failures that make headlines aren't proof the technology doesn't work, they're proof that firms deployed it without the compliance architecture legal work actually demands. Privilege, UPL exposure, and bar supervision duties mean the bar for "production-ready" is higher here than in most industries we build for.

Buying off-the-shelf covers a lot of ground for standard use cases. Custom builds make sense when your practice management stack, your data controls, or your risk profile don't fit a vendor's assumptions. And a lot of legal teams land in the middle, needing a partner who's actually built citation verification, scoped access controls, and audit trails before, rather than learning those lessons on your matter.

If you're working through this decision for your firm or legal department and want to compare notes with a team that has shipped production AI agents with real compliance requirements attached, get in touch with Genta AI Solutions.

View all

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect

We’re Here to Help

Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.

Let's Connect