By
April 19, 2026
9 min read
Vibe Coding in Production: What Actually Breaks (And What Doesn't)



The Gap Between "It Works" and "It Ships"
Vibe coding — describing what you want in plain language and letting an AI write the code — has gone from Andrej Karpathy's 2025 tweet to a full-blown movement in under a year. Search volume for the term has more than doubled since early 2025. Entire subreddits, YouTube channels, and cohort courses have been built around it.
Some of this enthusiasm is real and earned. AI coding tools genuinely accelerate certain kinds of work. But there's a recurring problem: most people writing about vibe coding are writing about building things, not operating them. The demos show green checkmarks. They rarely show the 3am incident where the AI-generated auth middleware silently allowed unauthenticated requests to a payment endpoint.
This post is about the gap between "the thing runs" and "the thing runs reliably in production, under real users, in a regulated environment, for the next eighteen months." Those are different problems and they require different thinking.
What Vibe Coding Actually Does Well
Before the honest critique: the wins are real.
For scaffolding, prototyping, and internal tooling, AI-assisted coding is genuinely faster than typing everything by hand. CRUD endpoints, boilerplate configuration, test fixtures, migration scripts — these are tasks where the AI produces output that is correct more often than not, and where the cost of a mistake is low because a human reviews it before it matters.
It's also legitimately useful for exploration. If you're learning a new framework, or you need to understand how a library handles authentication before you write your own implementation, using an LLM to generate working examples and then reading them is often faster than reading docs alone. The code becomes a kind of executable explanation.
And for solo founders or small product teams building MVPs, vibe coding can compress weeks of work into days. That's a real advantage for a category of problem where iteration speed matters more than correctness guarantees.
The issue is not the tool. The issue is the failure to recognize when the tool's strengths stop being the right fit for the problem in front of you.
Where Things Break: The Honest List
Security is the first casualty
LLMs are trained to produce code that works. "Works" in a training context usually means passes tests, produces expected output, and compiles. It does not mean resists injection attacks, correctly validates JWT claims, or handles OAuth edge cases that security researchers have been documenting for years.
The OWASP Top 10 reads like a list of the exact things AI-generated code tends to get wrong: broken access control, cryptographic failures, injection vulnerabilities. Not because the AI doesn't "know" these concepts — it can explain them in detail when asked — but because generating a quick working implementation defaults to the happy path. Secure implementations require explicit, deliberate prompting at every step, plus a human who knows what to look for reviewing the output.
Vibe coding a payments flow without a security-aware engineer in the loop is a specific kind of gamble. You might win. Production systems have shipped with worse code and survived. But the exposure is real and undisclosed to most people writing "I vibe coded my whole startup" posts.
State management collapses under complexity
Simple UIs and stateless APIs are well within vibe coding's comfort zone. The trouble starts when state has to be managed carefully across multiple services, sessions, or async operations. AI models are very good at generating code that handles the illustrated case. They're much weaker at reasoning about the combination of edge cases that only manifest under concurrent load or partial failure.
Race conditions, stale cache invalidations, and distributed transaction failures are not things you discover by reading the generated code. You discover them at 4pm on a Tuesday when your database has two conflicting records of the same order. By then, the AI has no useful context, and the original prompt is six months old.
This isn't a criticism of the model's intelligence. It's a structural limitation: the model doesn't know your specific deployment topology, your traffic patterns, or the fact that users routinely do things your spec never anticipated.
Maintenance costs compound silently
One of the most underappreciated costs of vibe-coded systems is what happens six months later when something needs to change. AI-generated code often lacks the kind of structural reasoning that makes future modification cheap. Abstractions get skipped because the model is optimizing for a working answer to the current prompt. Names are generic. Comments are auto-generated noise that describes what the code does rather than why it does it that way.
When a new engineer joins the team — or when you yourself return to the codebase after three months — the absence of legible intent makes every change riskier. You don't know which parts of the system are load-bearing. Refactoring becomes excavation.
This is what people in AI engineering sometimes call AI technical debt: not the code you wrote badly, but the structural deficit accumulated by systems where no one was deliberately reasoning about long-term architecture while the code was being generated.
LLMs don't know your system
Every new conversation with an AI coding assistant starts cold. The model doesn't know that you renamed that table six weeks ago, that your rate limiter has a known bug under IPv6, or that the third-party API you're integrating returns inconsistent error codes in production versus staging.
Experienced engineers carry this kind of institutional knowledge implicitly. They don't write it down because it lives in their heads. When you outsource implementation to a model, that knowledge doesn't transfer. The generated code makes assumptions about a generic, well-behaved system that your actual production environment does not match.
The problem compounds when you're making changes to existing code rather than generating new code from scratch. The model's output is coherent with the code it can see in the context window. It may silently contradict constraints that exist outside the context window: in other files, in infrastructure config, in undocumented API contracts.
The Real Question: Who Is in the Loop?
Most of the failures above share a common root: not enough human judgment applied at the right points. That's not an argument against AI coding tools. It's an argument for being deliberate about where human oversight sits in the process.
Teams that use AI coding effectively tend to have at least one engineer who understands the system well enough to review generated code critically — not just "does this compile" but "does this behave correctly under the specific conditions our system creates." They're also usually explicit about which parts of the codebase are high-risk enough to require human authorship, not AI generation with human review.
The Thoughtworks analysis of production-grade vibe coding makes a similar point: the tool is not the variable. The engineering judgment around the tool is the variable.
This is why "I vibe coded my entire startup" is less informative than it sounds. It might mean you built something genuinely production-ready with disciplined use of AI tools and careful human oversight at every inflection point. Or it might mean you deployed something that works in demos and will accumulate invisible debt until it doesn't. The statement doesn't distinguish between these two outcomes.
A Framework for Deciding What to Vibe Code
Not prescriptive rules — just a way of thinking about the decision.
Surface area and reversibility. Code that is hard to audit and hard to change if something goes wrong is poor vibe coding territory. Security-critical paths, payment processing, auth flows, and anything that touches user data with regulatory implications deserve human authorship or, at minimum, extremely careful human review against a clear specification. Code that is easy to test, easy to replace, and isolated from critical paths is good vibe coding territory.
Operational duration. A prototype you'll use for two weeks is different from a service you'll run for two years. The longer something lives in production, the more the accumulated structural decisions matter, and the more those decisions need to reflect deliberate human reasoning, not the defaults of a stateless language model.
Team knowledge. Vibe coding produces code faster than a team can build understanding of it. If the generated code outpaces the team's ability to internalize what it's doing, you end up with a system that no one can confidently modify. The right pace is one where the team's comprehension keeps up with the output rate.
These aren't novel principles. They're the same engineering judgment calls that existed before AI coding tools. Vibe coding doesn't change what matters in production; it changes how fast you can generate things that look like they work.
What Changes When You're Building at Scale
The teams most likely to get burned by vibe coding in production are the ones that successfully used it at small scale and then didn't change their process as the system grew. At small scale, human review is easy because the codebase is small and one or two engineers understand all of it. As it grows, the cognitive load increases, the blast radius of a bad change grows, and the cost of the structural deficits compounds.
Production-grade agentic systems — the kind that Genta builds for engineering teams — have a different relationship with AI-generated code than a solo founder's side project. They need clear observability so failures are visible, explicit handling of the cases AI tends to skip, and architecture that separates high-risk components from fast-moving ones.
The Anthropic engineering cookbook and similar practitioner resources are useful here because they show what production-oriented AI-assisted development actually looks like: structured, reviewed, tested, and never fully delegated to the model.
The Honest Verdict
Vibe coding is a real productivity multiplier for a real class of problems. The hype is not entirely manufactured.
But production is a different environment than a demo. It has users who do unexpected things, load that exposes race conditions, regulators who care about data handling, and engineers who will need to modify the codebase in ways the original prompt never anticipated. Code generated without explicit consideration of those conditions will encounter all of them eventually.
The teams doing this well are not the ones who found a way to vibe code everything. They're the ones who got precise about what should be vibe coded, what should be reviewed, and what should be human-authored. That precision is itself an engineering skill — and right now, it's the one most people are skipping.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
By
April 19, 2026
9 min read
Vibe Coding in Production: What Actually Breaks (And What Doesn't)



The Gap Between "It Works" and "It Ships"
Vibe coding — describing what you want in plain language and letting an AI write the code — has gone from Andrej Karpathy's 2025 tweet to a full-blown movement in under a year. Search volume for the term has more than doubled since early 2025. Entire subreddits, YouTube channels, and cohort courses have been built around it.
Some of this enthusiasm is real and earned. AI coding tools genuinely accelerate certain kinds of work. But there's a recurring problem: most people writing about vibe coding are writing about building things, not operating them. The demos show green checkmarks. They rarely show the 3am incident where the AI-generated auth middleware silently allowed unauthenticated requests to a payment endpoint.
This post is about the gap between "the thing runs" and "the thing runs reliably in production, under real users, in a regulated environment, for the next eighteen months." Those are different problems and they require different thinking.
What Vibe Coding Actually Does Well
Before the honest critique: the wins are real.
For scaffolding, prototyping, and internal tooling, AI-assisted coding is genuinely faster than typing everything by hand. CRUD endpoints, boilerplate configuration, test fixtures, migration scripts — these are tasks where the AI produces output that is correct more often than not, and where the cost of a mistake is low because a human reviews it before it matters.
It's also legitimately useful for exploration. If you're learning a new framework, or you need to understand how a library handles authentication before you write your own implementation, using an LLM to generate working examples and then reading them is often faster than reading docs alone. The code becomes a kind of executable explanation.
And for solo founders or small product teams building MVPs, vibe coding can compress weeks of work into days. That's a real advantage for a category of problem where iteration speed matters more than correctness guarantees.
The issue is not the tool. The issue is the failure to recognize when the tool's strengths stop being the right fit for the problem in front of you.
Where Things Break: The Honest List
Security is the first casualty
LLMs are trained to produce code that works. "Works" in a training context usually means passes tests, produces expected output, and compiles. It does not mean resists injection attacks, correctly validates JWT claims, or handles OAuth edge cases that security researchers have been documenting for years.
The OWASP Top 10 reads like a list of the exact things AI-generated code tends to get wrong: broken access control, cryptographic failures, injection vulnerabilities. Not because the AI doesn't "know" these concepts — it can explain them in detail when asked — but because generating a quick working implementation defaults to the happy path. Secure implementations require explicit, deliberate prompting at every step, plus a human who knows what to look for reviewing the output.
Vibe coding a payments flow without a security-aware engineer in the loop is a specific kind of gamble. You might win. Production systems have shipped with worse code and survived. But the exposure is real and undisclosed to most people writing "I vibe coded my whole startup" posts.
State management collapses under complexity
Simple UIs and stateless APIs are well within vibe coding's comfort zone. The trouble starts when state has to be managed carefully across multiple services, sessions, or async operations. AI models are very good at generating code that handles the illustrated case. They're much weaker at reasoning about the combination of edge cases that only manifest under concurrent load or partial failure.
Race conditions, stale cache invalidations, and distributed transaction failures are not things you discover by reading the generated code. You discover them at 4pm on a Tuesday when your database has two conflicting records of the same order. By then, the AI has no useful context, and the original prompt is six months old.
This isn't a criticism of the model's intelligence. It's a structural limitation: the model doesn't know your specific deployment topology, your traffic patterns, or the fact that users routinely do things your spec never anticipated.
Maintenance costs compound silently
One of the most underappreciated costs of vibe-coded systems is what happens six months later when something needs to change. AI-generated code often lacks the kind of structural reasoning that makes future modification cheap. Abstractions get skipped because the model is optimizing for a working answer to the current prompt. Names are generic. Comments are auto-generated noise that describes what the code does rather than why it does it that way.
When a new engineer joins the team — or when you yourself return to the codebase after three months — the absence of legible intent makes every change riskier. You don't know which parts of the system are load-bearing. Refactoring becomes excavation.
This is what people in AI engineering sometimes call AI technical debt: not the code you wrote badly, but the structural deficit accumulated by systems where no one was deliberately reasoning about long-term architecture while the code was being generated.
LLMs don't know your system
Every new conversation with an AI coding assistant starts cold. The model doesn't know that you renamed that table six weeks ago, that your rate limiter has a known bug under IPv6, or that the third-party API you're integrating returns inconsistent error codes in production versus staging.
Experienced engineers carry this kind of institutional knowledge implicitly. They don't write it down because it lives in their heads. When you outsource implementation to a model, that knowledge doesn't transfer. The generated code makes assumptions about a generic, well-behaved system that your actual production environment does not match.
The problem compounds when you're making changes to existing code rather than generating new code from scratch. The model's output is coherent with the code it can see in the context window. It may silently contradict constraints that exist outside the context window: in other files, in infrastructure config, in undocumented API contracts.
The Real Question: Who Is in the Loop?
Most of the failures above share a common root: not enough human judgment applied at the right points. That's not an argument against AI coding tools. It's an argument for being deliberate about where human oversight sits in the process.
Teams that use AI coding effectively tend to have at least one engineer who understands the system well enough to review generated code critically — not just "does this compile" but "does this behave correctly under the specific conditions our system creates." They're also usually explicit about which parts of the codebase are high-risk enough to require human authorship, not AI generation with human review.
The Thoughtworks analysis of production-grade vibe coding makes a similar point: the tool is not the variable. The engineering judgment around the tool is the variable.
This is why "I vibe coded my entire startup" is less informative than it sounds. It might mean you built something genuinely production-ready with disciplined use of AI tools and careful human oversight at every inflection point. Or it might mean you deployed something that works in demos and will accumulate invisible debt until it doesn't. The statement doesn't distinguish between these two outcomes.
A Framework for Deciding What to Vibe Code
Not prescriptive rules — just a way of thinking about the decision.
Surface area and reversibility. Code that is hard to audit and hard to change if something goes wrong is poor vibe coding territory. Security-critical paths, payment processing, auth flows, and anything that touches user data with regulatory implications deserve human authorship or, at minimum, extremely careful human review against a clear specification. Code that is easy to test, easy to replace, and isolated from critical paths is good vibe coding territory.
Operational duration. A prototype you'll use for two weeks is different from a service you'll run for two years. The longer something lives in production, the more the accumulated structural decisions matter, and the more those decisions need to reflect deliberate human reasoning, not the defaults of a stateless language model.
Team knowledge. Vibe coding produces code faster than a team can build understanding of it. If the generated code outpaces the team's ability to internalize what it's doing, you end up with a system that no one can confidently modify. The right pace is one where the team's comprehension keeps up with the output rate.
These aren't novel principles. They're the same engineering judgment calls that existed before AI coding tools. Vibe coding doesn't change what matters in production; it changes how fast you can generate things that look like they work.
What Changes When You're Building at Scale
The teams most likely to get burned by vibe coding in production are the ones that successfully used it at small scale and then didn't change their process as the system grew. At small scale, human review is easy because the codebase is small and one or two engineers understand all of it. As it grows, the cognitive load increases, the blast radius of a bad change grows, and the cost of the structural deficits compounds.
Production-grade agentic systems — the kind that Genta builds for engineering teams — have a different relationship with AI-generated code than a solo founder's side project. They need clear observability so failures are visible, explicit handling of the cases AI tends to skip, and architecture that separates high-risk components from fast-moving ones.
The Anthropic engineering cookbook and similar practitioner resources are useful here because they show what production-oriented AI-assisted development actually looks like: structured, reviewed, tested, and never fully delegated to the model.
The Honest Verdict
Vibe coding is a real productivity multiplier for a real class of problems. The hype is not entirely manufactured.
But production is a different environment than a demo. It has users who do unexpected things, load that exposes race conditions, regulators who care about data handling, and engineers who will need to modify the codebase in ways the original prompt never anticipated. Code generated without explicit consideration of those conditions will encounter all of them eventually.
The teams doing this well are not the ones who found a way to vibe code everything. They're the ones who got precise about what should be vibe coded, what should be reviewed, and what should be human-authored. That precision is itself an engineering skill — and right now, it's the one most people are skipping.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
By
April 19, 2026
9 min read
Vibe Coding in Production: What Actually Breaks (And What Doesn't)



The Gap Between "It Works" and "It Ships"
Vibe coding — describing what you want in plain language and letting an AI write the code — has gone from Andrej Karpathy's 2025 tweet to a full-blown movement in under a year. Search volume for the term has more than doubled since early 2025. Entire subreddits, YouTube channels, and cohort courses have been built around it.
Some of this enthusiasm is real and earned. AI coding tools genuinely accelerate certain kinds of work. But there's a recurring problem: most people writing about vibe coding are writing about building things, not operating them. The demos show green checkmarks. They rarely show the 3am incident where the AI-generated auth middleware silently allowed unauthenticated requests to a payment endpoint.
This post is about the gap between "the thing runs" and "the thing runs reliably in production, under real users, in a regulated environment, for the next eighteen months." Those are different problems and they require different thinking.
What Vibe Coding Actually Does Well
Before the honest critique: the wins are real.
For scaffolding, prototyping, and internal tooling, AI-assisted coding is genuinely faster than typing everything by hand. CRUD endpoints, boilerplate configuration, test fixtures, migration scripts — these are tasks where the AI produces output that is correct more often than not, and where the cost of a mistake is low because a human reviews it before it matters.
It's also legitimately useful for exploration. If you're learning a new framework, or you need to understand how a library handles authentication before you write your own implementation, using an LLM to generate working examples and then reading them is often faster than reading docs alone. The code becomes a kind of executable explanation.
And for solo founders or small product teams building MVPs, vibe coding can compress weeks of work into days. That's a real advantage for a category of problem where iteration speed matters more than correctness guarantees.
The issue is not the tool. The issue is the failure to recognize when the tool's strengths stop being the right fit for the problem in front of you.
Where Things Break: The Honest List
Security is the first casualty
LLMs are trained to produce code that works. "Works" in a training context usually means passes tests, produces expected output, and compiles. It does not mean resists injection attacks, correctly validates JWT claims, or handles OAuth edge cases that security researchers have been documenting for years.
The OWASP Top 10 reads like a list of the exact things AI-generated code tends to get wrong: broken access control, cryptographic failures, injection vulnerabilities. Not because the AI doesn't "know" these concepts — it can explain them in detail when asked — but because generating a quick working implementation defaults to the happy path. Secure implementations require explicit, deliberate prompting at every step, plus a human who knows what to look for reviewing the output.
Vibe coding a payments flow without a security-aware engineer in the loop is a specific kind of gamble. You might win. Production systems have shipped with worse code and survived. But the exposure is real and undisclosed to most people writing "I vibe coded my whole startup" posts.
State management collapses under complexity
Simple UIs and stateless APIs are well within vibe coding's comfort zone. The trouble starts when state has to be managed carefully across multiple services, sessions, or async operations. AI models are very good at generating code that handles the illustrated case. They're much weaker at reasoning about the combination of edge cases that only manifest under concurrent load or partial failure.
Race conditions, stale cache invalidations, and distributed transaction failures are not things you discover by reading the generated code. You discover them at 4pm on a Tuesday when your database has two conflicting records of the same order. By then, the AI has no useful context, and the original prompt is six months old.
This isn't a criticism of the model's intelligence. It's a structural limitation: the model doesn't know your specific deployment topology, your traffic patterns, or the fact that users routinely do things your spec never anticipated.
Maintenance costs compound silently
One of the most underappreciated costs of vibe-coded systems is what happens six months later when something needs to change. AI-generated code often lacks the kind of structural reasoning that makes future modification cheap. Abstractions get skipped because the model is optimizing for a working answer to the current prompt. Names are generic. Comments are auto-generated noise that describes what the code does rather than why it does it that way.
When a new engineer joins the team — or when you yourself return to the codebase after three months — the absence of legible intent makes every change riskier. You don't know which parts of the system are load-bearing. Refactoring becomes excavation.
This is what people in AI engineering sometimes call AI technical debt: not the code you wrote badly, but the structural deficit accumulated by systems where no one was deliberately reasoning about long-term architecture while the code was being generated.
LLMs don't know your system
Every new conversation with an AI coding assistant starts cold. The model doesn't know that you renamed that table six weeks ago, that your rate limiter has a known bug under IPv6, or that the third-party API you're integrating returns inconsistent error codes in production versus staging.
Experienced engineers carry this kind of institutional knowledge implicitly. They don't write it down because it lives in their heads. When you outsource implementation to a model, that knowledge doesn't transfer. The generated code makes assumptions about a generic, well-behaved system that your actual production environment does not match.
The problem compounds when you're making changes to existing code rather than generating new code from scratch. The model's output is coherent with the code it can see in the context window. It may silently contradict constraints that exist outside the context window: in other files, in infrastructure config, in undocumented API contracts.
The Real Question: Who Is in the Loop?
Most of the failures above share a common root: not enough human judgment applied at the right points. That's not an argument against AI coding tools. It's an argument for being deliberate about where human oversight sits in the process.
Teams that use AI coding effectively tend to have at least one engineer who understands the system well enough to review generated code critically — not just "does this compile" but "does this behave correctly under the specific conditions our system creates." They're also usually explicit about which parts of the codebase are high-risk enough to require human authorship, not AI generation with human review.
The Thoughtworks analysis of production-grade vibe coding makes a similar point: the tool is not the variable. The engineering judgment around the tool is the variable.
This is why "I vibe coded my entire startup" is less informative than it sounds. It might mean you built something genuinely production-ready with disciplined use of AI tools and careful human oversight at every inflection point. Or it might mean you deployed something that works in demos and will accumulate invisible debt until it doesn't. The statement doesn't distinguish between these two outcomes.
A Framework for Deciding What to Vibe Code
Not prescriptive rules — just a way of thinking about the decision.
Surface area and reversibility. Code that is hard to audit and hard to change if something goes wrong is poor vibe coding territory. Security-critical paths, payment processing, auth flows, and anything that touches user data with regulatory implications deserve human authorship or, at minimum, extremely careful human review against a clear specification. Code that is easy to test, easy to replace, and isolated from critical paths is good vibe coding territory.
Operational duration. A prototype you'll use for two weeks is different from a service you'll run for two years. The longer something lives in production, the more the accumulated structural decisions matter, and the more those decisions need to reflect deliberate human reasoning, not the defaults of a stateless language model.
Team knowledge. Vibe coding produces code faster than a team can build understanding of it. If the generated code outpaces the team's ability to internalize what it's doing, you end up with a system that no one can confidently modify. The right pace is one where the team's comprehension keeps up with the output rate.
These aren't novel principles. They're the same engineering judgment calls that existed before AI coding tools. Vibe coding doesn't change what matters in production; it changes how fast you can generate things that look like they work.
What Changes When You're Building at Scale
The teams most likely to get burned by vibe coding in production are the ones that successfully used it at small scale and then didn't change their process as the system grew. At small scale, human review is easy because the codebase is small and one or two engineers understand all of it. As it grows, the cognitive load increases, the blast radius of a bad change grows, and the cost of the structural deficits compounds.
Production-grade agentic systems — the kind that Genta builds for engineering teams — have a different relationship with AI-generated code than a solo founder's side project. They need clear observability so failures are visible, explicit handling of the cases AI tends to skip, and architecture that separates high-risk components from fast-moving ones.
The Anthropic engineering cookbook and similar practitioner resources are useful here because they show what production-oriented AI-assisted development actually looks like: structured, reviewed, tested, and never fully delegated to the model.
The Honest Verdict
Vibe coding is a real productivity multiplier for a real class of problems. The hype is not entirely manufactured.
But production is a different environment than a demo. It has users who do unexpected things, load that exposes race conditions, regulators who care about data handling, and engineers who will need to modify the codebase in ways the original prompt never anticipated. Code generated without explicit consideration of those conditions will encounter all of them eventually.
The teams doing this well are not the ones who found a way to vibe code everything. They're the ones who got precise about what should be vibe coded, what should be reviewed, and what should be human-authored. That precision is itself an engineering skill — and right now, it's the one most people are skipping.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.
We’re Here to Help
Ready to transform your operations? We're here to help. Contact us today to learn more about our innovative solutions and expert services.