The Model That Forgot What a Car Wash Is
AI models pattern-match before they reason. Your code pipeline has the same design.
Four AI models were given a simple question:
My car wash is a one-minute walk from my house. Should I: A) Drive, or B) Walk? Pick one.
Three picked B. One picked A.
Claude: “Walk. One minute is barely worth starting the engine.” ChatGPT: Walk, because it would “avoid the hassle of moving your car twice.” Gemini’s faster model went further, explaining that short drives don’t give your engine time to reach operating temperature and can cause moisture buildup in the oil. Confident answers. Detailed rationales.
A car wash is where you wash your car. The car needs to be there.
Every wrong answer came with a rationale that sounded reasonable. None of the rationales addressed why you go to a car wash. Only Gemini’s larger model caught it: “Unless you’re planning to carry your car, you’ll need to drive it there so it can actually get washed.”
When Claude was separately asked to reconsider from first principles, it corrected instantly. “Ha, good point. You’re going to the car wash to wash your car — which means the car needs to be there. Drive it.” One prompt. The reasoning was always available. It wasn’t the default.
The split is the finding. Same question. Same company’s models, in Gemini’s case. Different results. You cannot predict which model will reason and which will pattern-match on any given prompt.
Nobody will lose sleep over a wrong answer about a car wash. But the design that produced it is the same design running your code pipeline. AI-assisted development is no longer optional at most organizations. The model is a supplier in your software supply chain — one you evaluated on benchmarks and demos, not on whether it reasons about your specific security requirements.
The training pipeline for large language models collects human-generated text, learns the patterns, and reproduces them. The pattern for “short distance question” is “walking is better.” The pattern for SQL queries is whatever appeared most frequently on GitHub, secure or not. These patterns fire before the model considers the purpose of the task. The answer arrives before the reasoning starts.
Backslash Security tested seven popular LLMs on code generation in 2025. When given simple prompts, every model produced code vulnerable to at least four of the OWASP Top 10 weaknesses. When the prompts explicitly specified security requirements, five of seven models still produced vulnerable code. GPT-4o, given the instruction “make sure you are writing secure code,” produced secure output only 20% of the time. Veracode’s 2025 GenAI Code Security Reportconfirmed the pattern at scale: 45% of code samples introduced OWASP Top 10 vulnerabilities, and security performance has flatlined even as syntax has improved dramatically. Larger models don’t produce more secure code than smaller ones.
The instruction was clear. The training pattern was stronger.
The standard response to this data looks reasonable from inside. Teams write system prompts specifying security requirements. They reference OWASP guidelines. They include internal coding standards in the context window. Some fine-tune on internal codebases. The assumption is that the model now “knows” the standards and will follow them.
This assumption treats the model like a junior developer who read the documentation. It isn’t one. A junior developer who reads the OWASP Top 10 learns a principle and applies it to new situations. A model that has the OWASP Top 10 in its context window has seen the document. It has also been trained on millions of lines of code that violate it. The instruction and the training compete. The 20% secure output rate is the scoreboard.
The model doesn’t apply standards. It produces output that looks like output produced by someone who applied standards. Most deployment pipelines don’t distinguish between these.
Five changes move the failure rate from “unknown and untested” to “visible and managed.” None of them make the model reason. All of them make it harder for pattern-matched output to reach production unchallenged.
1. Define what correct looks like before you generate. The car wash models failed because nothing in the prompt specified the purpose of the trip. The question let them skip past the requirement and jump to the answer. Test-driven development applied to AI-generated code inverts this. Write the security test before you ask the model to write the implementation. Input validation test exists before the endpoint code is generated. Authentication test exists before the auth flow is written. The test encodes the reasoning the model won’t do by default. The model doesn’t need to reason about security if the test already embodies the security requirement. It just needs to pass. This is the strongest change on this list because it doesn’t depend on the model improving. It works with the models as they are today. If the car wash prompt had been “I need my car washed at the car wash one minute away,” every model would have gotten it right. Defining the requirement changes the output.
2. Separate generation from evaluation. The model that writes the code should not be the only model that reviews it. Use a second model, a different model, or a static analysis tool to evaluate the output against the specific standard you care about. The car wash failure happened because the model generated and self-evaluated in the same pass. The pattern that produced the wrong answer also produced the rationale for it. A separate evaluator breaks that loop. In code pipelines, this means running AI-generated code through a security scanner before it enters review, not after.
3. Force the model to state assumptions before conclusions. The car wash models failed because they answered before considering the purpose of the trip. In code generation, the equivalent is producing an implementation before stating the security model. Structure your prompts to require the model to list its assumptions about the threat environment, the trust boundaries, and the input validation requirements before it writes a line of code. This doesn’t guarantee reasoning. It makes the absence of reasoning visible. When the model states “I assume all input is trusted” before writing code that doesn’t validate input, the failure is in the open instead of hidden inside a clean-looking implementation.
4. Test for the car wash, not just the syntax. Most AI code evaluation checks whether the output compiles and passes functional tests. That’s exactly where the models excel — and exactly where they hide vulnerabilities. Your test suite needs adversarial cases that check whether the model reasoned about security, not just whether the code runs. Write tests that specifically target the patterns models get wrong: input validation, authentication logic, authorization checks, output encoding. If your test suite doesn’t include a “car wash question” for your domain — a simple case where the obvious pattern-matched answer is wrong — add one.
5. Treat model consistency as a signal, not a guarantee. Gemini’s larger model got the car wash right. Gemini’s faster model got it wrong. Same company. Same question. If you’re selecting models for security-critical tasks, test each model on your specific failure cases, not on benchmarks. Run your own car wash tests: simple questions in your domain where the pattern-matched answer is wrong and the reasoned answer is right. Track which models pass. Re-test when models update, because a new training run can shift which patterns dominate.
These changes manage the risk. They don’t eliminate it.
What they don’t solve: the fundamental problem that you cannot reliably distinguish model output that was reasoned from output that was pattern-matched without independently verifying the answer. Every verification step adds cost and time. At some point, the overhead of verifying AI output approaches the cost of writing the code yourself. The efficiency gain from AI-assisted development depends on trusting some outputs without full verification. Where you draw that line is a risk decision, not a technical one. For regulated industries, it’s also a compliance question: code that a human approved because it looked like it met the standard is not code that was reviewed against the standard. The gap between pattern-matched output and reasoned output is a gap your auditor will eventually find.
What remains genuinely unsolved: making models reason by default. The research community is working on it. None of the approaches are production-ready for security-critical applications today. Anyone claiming otherwise should be asked the car wash question.
The honest state of things: AI-assisted development is a bet that the model’s training patterns will align with the correct answer on each specific prompt. Sometimes they will. The process changes above make it visible when they don’t, before the code reaches production. That’s not a solution. It’s damage control. Right now, damage control is what’s available.
The model was trained on patterns. Your security depends on reasoning. Design your pipeline for the gap between them.
References
Backslash Security, “Can AI Vibe Coding Be Trusted?” (April 2025) — backslash.security
Veracode, “2025 GenAI Code Security Report” (July 2025) — veracode.com
Infosecurity Magazine, “Popular LLMs Found to Produce Vulnerable Code by Default” (April 2025) — infosecurity-magazine.com
Georgetown CSET, “Cybersecurity Risks of AI-Generated Code” (November 2024) — cset.georgetown.edu
Dark Reading, “LLMs’ AI-Generated Code Remains Wildly Insecure” (August 2025) — darkreading.com
Peters, D. & Ceci, S., “Peer-review practices of psychological journals: The fate of published articles, submitted again,” Behavioral and Brain Sciences 5(2), 187–195 (1982)



