Back to System Operations & Security
BlogSystem Operations & SecurityNov 19, 2025

The Transformation of Testing in AI-Enabled Systems

The Transformation of Testing in AI-Enabled Systems

The Six Million Dollar Bot

In April 2026, a story made the rounds that captures, more compactly than any analysis could, where the discipline of testing currently stands in the era of AI-enabled systems. A retailer's CEO had laid off the company's entire 12-person QA team a month earlier and replaced them with an AI-driven automated testing pipeline. The savings on paper were $1.2 million per year. The arithmetic looked clean. Then the AI shipped a regression that produced a 100% discount code applied across the entire store, and within hours the company had lost $6 million in orders. The aftermath — including, allegedly, the CEO asking one of the recently laid-off senior QA engineers to come back and remediate the incident without pay — turned what could have been a quiet case study into a viral cautionary tale that has been circulated extensively inside testing communities.

It is tempting to read this story as a parable about corporate overreach, or about the limits of AI, or about the timeless wisdom of not firing your QA team. All three readings are partially right. But the more useful reading — and the one that explains why this piece exists — is that the incident is a symptom of a deeper structural problem that almost every organization deploying AI-enabled software is now navigating, mostly without admitting it: the discipline of software testing as it has been practiced for the last forty years was built on an assumption that AI-enabled systems systematically violate, and the field is in the middle of a difficult, unfinished transition to figure out what testing should actually be when the assumption no longer holds.

Most coverage of "AI and testing" focuses on the tooling — AI-generated tests, AI-driven test case prioritization, intelligent flaky test detection. These are all real and worth knowing about. But they are downstream of a more fundamental shift, and the rest of this piece is about that shift, anchored throughout to the discount-code incident because the incident illustrates almost every dimension of what is actually changing.

The Assumption That Breaks

Conventional software testing rests on a single foundational assumption that has been so reliable for so long that most practitioners no longer notice they are making it: the same input, given to the same code, will produce the same output. This is determinism. It is the property that allows you to write expect(processOrder(input)).toEqual ({status: 'confirmed', orderId: 'ORD-123'}) and trust that if the test passes today, it will pass tomorrow for the same reasons.

Determinism is what makes regression testing meaningful. It is what makes test suites grow stronger over time as you add cases. It is what allows a CI/CD pipeline to gate deployment on a green build. It is what allows you to bisect a bug to a specific commit. The entire epistemic foundation of software testing — the reason you can trust a test to tell you something true about a system — depends on the system being deterministic enough that "the test passed" is a stable claim about reality.

AI-enabled systems break this assumption in three distinct ways, and the mature response to each is different.

The first break is output non-determinism. The same prompt, given to the same model, can produce different outputs across calls. Temperature settings introduce explicit randomness; even at temperature zero, hardware-level non-determinism in floating-point operations can produce different tokens. A test that checks for an exact string match will fail intermittently for reasons that are not bugs.

The second break is upstream non-stationarity. The model behind your AI feature is not, in most production deployments, a fixed artifact. It is hosted by OpenAI, Anthropic, Google, or another provider that ships updates on its own cadence. Your code did not change. Your tests did not change. The model behind the API endpoint did. A test suite that passed last Tuesday may fail this Tuesday because of an upstream change you did not authorize and were not notified about.

The third break is contextual drift. Even if the model is fixed, the retrieval-augmented context flowing into the model is changing. The knowledge base updates, the user history accumulates, the prompt template gets a new instruction added by someone in a different team. The AI feature's behavior is the joint product of the model, the prompt, the retrieval, and the user history — and only one of those is under your version control in the way traditional software is.

The CEO who fired the QA team in April 2026 did not understand any of these three breaks well enough to know what could go wrong. The pipeline they bought tested for the things their previous QA team had tested for — does the order go through, is the price calculated correctly, does the discount apply when the rules say it should. The AI feature that produced the catastrophic 100% discount probably passed all of those tests. The failure was not in any of the cases the test suite was designed to cover. It was in a case the test suite did not know how to look for, because the new failure modes that AI systems produce do not correspond to the input/output contracts that traditional tests describe.

What Regression Testing Becomes When It Decays

There is a property of LLM-system testing that is worth naming directly because it is structurally inverted from how QA has worked for forty years: regression testing for LLM systems decays over time instead of strengthening.

In traditional software, your regression suite gets stronger as it grows. Every bug becomes a test. Every edge case becomes a test. Every customer-reported failure becomes a test. The suite accumulates institutional knowledge. A test written four years ago about a bug fixed four years ago is still meaningful today, because the system is deterministic and the bug, once fixed, would only return through a specific kind of regression that the test would catch.

In LLM systems, the suite ages out. The prompt template evolves. The model gets updated. The retrieval corpus shifts. The product introduces new query shapes that the original test cases did not anticipate. A regression test that ran against last quarter's prompt template, evaluating last quarter's model behavior, against last quarter's customer data shapes, may still technically pass — and tell you nothing useful about whether the current system is working. It is a green test that is no longer testing what its name suggests.

This decay is one of the most uncomfortable properties of testing AI systems, and it is one of the least discussed in vendor literature. The implication is that the QA function for an AI-enabled product cannot be a one-time investment that compounds. It has to be a continuously refreshed practice where test cases are regularly retired, new ones are added based on production traffic, and the threshold for what counts as "passing" is itself revisited as the system's environment changes. The teams that understand this run their evals like a living dataset that gets curated week over week. The teams that don't run their evals like a 1998 unit test suite that gets larger and slowly stops correlating with whether anything works.

Note that the discount-code incident in April probably involved exactly this pattern. The AI testing pipeline almost certainly had assertions that had been written months earlier against an earlier version of the system. They probably still passed. The actual production behavior had drifted into territory the tests were not measuring, and the tests' continued green status was, in retrospect, a more dangerous signal than no tests at all — because it produced confidence that was not warranted.

The Replacement: Evaluations, and Why They're Different

What is replacing traditional testing for AI systems is something the field has agreed to call evaluation — eval, in the casual usage. The terminology is doing real work. It is not a rebrand of testing. It is a different epistemic practice, and the differences matter.

A test asks: did this specific input produce the expected output? The answer is yes or no. The test passes or fails. There is no middle.

An evaluation asks: across a representative sample of inputs, how well is this system performing on the dimensions we care about? The answer is a distribution, with a score, against a tolerance band that the team has decided in advance. An eval doesn't pass or fail in the binary sense. It produces a number that gets compared to a threshold. Below the threshold, you investigate. Above the threshold, you ship.

This shift from binary assertion to statistical scoring has a number of implications that haven't been fully absorbed by most engineering organizations. The release process can no longer be a green/red light gate; it has to be a threshold-based decision that allows for reasonable variation. The metrics being scored are themselves not always objective — semantic similarity, faithfulness to retrieved context, format adherence, safety compliance — and the choice of metric encodes assumptions that shape what the eval can and cannot detect. The same eval, run twice, may produce slightly different scores because the underlying system is non-deterministic, which means you have to run each test multiple times and use the majority outcome rather than relying on a single execution. None of this matches the mental model that ten years of CI/CD has trained engineering teams to have about what testing means.

The most consequential implication is that evals require a quality of judgment that traditional tests don't. Writing a unit test mostly requires understanding the input and the expected output. Writing an eval requires understanding what good looks like for a non-deterministic system, which is a much harder problem. It requires curating a representative dataset of inputs that captures the distribution of real usage. It requires defining grading rubrics that capture the qualities you care about. It requires deciding which differences in output are meaningful and which are stylistic noise. This is not a task that the engineer who writes the feature can easily do by themselves; it requires either domain expertise about what users actually need or a structured process for capturing user judgment in a form the eval can use.

The mature pattern that has emerged in the last twelve months is what practitioners call the evaluation harness: a structured combination of test cases (tasks), grading logic (graders), full execution traces (transcripts), and verifiable results (outcomes), run repeatedly across multiple dimensions of system performance. The harness is to AI systems what unit tests were to deterministic software — the foundational tool for knowing whether the thing works — but it is meaningfully more complicated to build, more expensive to maintain, and more dependent on human judgment than unit tests were.

The Trajectory Problem That Final-Output Tests Miss

Within evaluation practice itself, there is a sub-problem worth flagging because it is producing a class of failures that most teams are not catching: agent systems fail at the trajectory level in ways that are invisible to final-output evaluation.

Consider an agent that is asked to book a flight. The agent makes ten tool calls, maintains context across the conversation, decides between options, handles a payment step, and returns confirmation. The final output is a booking confirmation that, on inspection, looks correct. The eval grades it as a pass.

What the eval did not capture: at step three, the agent made a tool call that misinterpreted the user's date preference, almost booked the wrong flight, and only corrected itself because of a follow-up clarification in step four. The agent's final output was correct, but the trajectory through the task was unstable, and a slightly different conversation would have produced the wrong booking. Research on LLM agent benchmarks has found that agents evaluated only on final-output quality pass 20–40% more test cases than full trajectory evaluation reveals. That gap is the regression that current evaluation practice is mostly not catching.

The implication is that for agent systems specifically, the eval has to grade the full execution trace, not just the final answer. This is operationally expensive — traces are long, grading is slow, the rubrics are complex — and most teams have not yet built the tooling to do it properly. The teams that have built it are catching agent failures their competitors are missing. The teams that haven't are shipping agents whose final outputs look right and whose intermediate behavior is unreliable in ways that will eventually produce a public failure.

The April discount-code incident is, almost certainly, an example of this trajectory problem. The AI testing pipeline was checking whether orders processed correctly. It was probably not checking the full execution path the AI feature took to apply discounts. The 100% discount was a trajectory failure that produced a final output the test pipeline scored as valid — the order processed, after all. The fact that the price was zero was, structurally, a different category of bug than the test pipeline was designed to detect.

The Human Role That Is Quietly Becoming More Valuable

The most counterintuitive thing happening in the AI testing space — and the one most directly contradicted by the April incident's headline framing — is that the value of skilled human testers is increasing, not decreasing, in well-run organizations.

The reasoning is straightforward once it is stated. AI handles routine test execution well. It handles regression checks at scale. It generates plausible test cases from specifications. It identifies patterns in failures. None of these are the bottleneck in testing AI systems. The bottleneck is what counts as good, and that question has structurally moved closer to humans, not further from them.

Curating a golden dataset that represents the distribution of real user behavior requires human judgment about what users are actually doing. Defining grading rubrics for non-deterministic outputs requires human understanding of what qualities matter for the specific application. Conducting error analysis on cases where the LLM-as-judge disagrees with the human evaluator requires someone who can reason about why the disagreement is happening. Identifying behavioral boundaries — where the system should refuse, where it should escalate, where it should apply caution — requires people who understand the domain and the organization's risk posture. Applause's 2026 survey found that 60.8% of teams conduct evaluations with humans and 54.4% use human-generated prompt and response datasets for fine-tuning. The human-in-the-loop component is not vestigial. It is load-bearing.

The QA roles that are being eliminated, when the elimination is being done thoughtfully rather than CEO-by-spreadsheet, are the manual execution roles — the ones that involved running the same test scripts repeatedly. The QA roles that are increasing in value are the ones that involve test strategy, risk assessment, dataset curation, judgment calibration, and the integration of QA insight into product decisions. Industry data has these higher-end roles commanding 20–40% salary premiums in 2026, with particularly steep premiums in regulated industries (financial services, healthcare, automotive) where the cost of a testing failure is large enough that companies are willing to pay for the judgment that prevents it.

The April CEO's mistake, in retrospect, was treating QA as a cost center where any reduction was good and not understanding that the reduction had to be in the right kind of QA work. Firing twelve QA engineers to save $1.2 million is fine if the work they were doing was the kind of work AI now does well. It is not fine if they were the people who, among other things, would have noticed that the AI testing pipeline didn't have a guard against pricing-rule failures. The $6 million loss is not really an indictment of AI testing tools. It is an indictment of an organization that didn't understand which parts of its QA function were AI-replaceable and which were the parts that prevented exactly this kind of incident from shipping.

What This Means for Organizations Building AI-Enabled Systems

Step back from the technical detail and the practical implications for any organization shipping AI features in 2026 are unusually concrete.

The traditional testing infrastructure your team has — unit tests, integration tests, end-to-end tests — is necessary and not sufficient for AI-enabled features. The AI portions of the system need an evaluation harness that runs continuously, grades on multiple dimensions, evaluates trajectories rather than just final outputs for agentic systems, and is calibrated by human judgment on representative datasets that get refreshed as the system's environment changes. This is not optional infrastructure. It is the new equivalent of having a CI pipeline. Organizations that don't have it are shipping AI features whose actual production behavior they cannot verify.

The release process for AI features has to be different from the release process for traditional features. Threshold-based gates, not binary pass/fail. Multiple eval runs with majority outcomes, not single executions. Production monitoring that continues evaluating the system's behavior after release, because the behavior can change without your code changing. The teams that have built this are catching regressions early. The teams that haven't are catching them when their customers do.

The QA function has to be reconstituted, not eliminated. The role is shifting from executing tests to defining what good looks like, and the latter role is more skilled, more expensive, and more important than the former was. The organizations getting this right are upskilling their QA teams into evaluation engineers, dataset curators, and AI quality strategists. The organizations getting this wrong are firing their QA teams to save money and discovering, sometimes within weeks, that they have lost the institutional capacity to know whether their AI features actually work.

And — this is the part that should be uncomfortable for any leader making rapid AI-related staffing decisions — the organizations that are pulling ahead are the ones that take the discipline of testing AI systems seriously as its own thing, neither dismissing it as solved nor pretending the old testing toolkit is sufficient. The discipline is in transition. The tools are immature. The best practices are being figured out in real time. The leader who treats this transition as something the team will figure out organically while shipping features is the leader whose team will figure it out at the cost of a public incident, in the same general shape as the April 2026 retailer's $6 million afternoon.

Back to the Bot

The story of the discount-code bot and the laid-off QA team became a viral cautionary tale because it compresses, into a single weekend, almost every dimension of what makes testing AI-enabled systems hard. Non-determinism that traditional tests don't catch. Trajectory failures that final-output evaluation misses. The decay of regression suites against an evolving system. The misclassification of QA as a routine cost rather than a judgment-intensive function. The CEO confidence that AI tools can replace human QA without understanding what human QA actually was doing.

The transformation of testing in AI-enabled systems is not, fundamentally, about new tools. The new tools exist and they are mostly good. It is about a discipline that has had to absorb, in the space of about three years, the structural fact that its foundational assumption — same input, same output — does not hold for the kind of systems being shipped. The replacement discipline is real but immature. The organizations that take it seriously will produce AI systems that work reliably. The organizations that don't will keep producing the discount-code bot, in different shapes, until enough of them have publicly failed that the lesson becomes industry-wide consensus.

The April 2026 incident is, in a sense, the field doing its learning in public. The next round of these incidents — and there will be a next round — will involve AI systems whose failures cost more than $6 million, in industries where the cost of failure is measured in human harm rather than refunded orders. The teams that have done the testing-discipline work properly will not be the ones in the news. The teams that haven't, will. The transition the discipline is in the middle of is uncomfortable, expensive, and unfinished. It is also non-optional. And the organizations treating it as if it were optional are, whether they realize it or not, scheduling a public lesson for themselves.