Why Agentic Quality Engineering Breaks Without Real Devices
Frank Moyer
There is a ritual in mobile teams that has survived every new framework, every new vendor category, and every bold promise about faster releases.
A build is ready. Someone picks up a phone.
Maybe it is the developer, leaning back from the laptop and tapping through the flow by hand. Maybe it is a tester sitting nearby. Maybe it is a QA team on the other side of a handoff. Different org chart, same motion. Software changed, and now a human being has to physically validate it on a device.
That model is starting to break.
Not because real-device testing no longer matters. It does.
Not because humans no longer matter. They do too.
It is breaking because the starting point is moving. The quality loop no longer has to begin with a person holding a phone. For a large portion of test cases, it can begin in the developer’s AI workspace (Claude Code, OpenAI Codex, Copilot), where the code changed in the first place, where the requirements live, and where feedback can influence the next decision within minutes.
That is what makes Agentic Quality Engineering different.
This is not a shinier version of mobile test automation. It is not just better script generation. It is a change in operating model. It turns validation from a human-led execution process into a developer-initiated, machine-led loop that can decide what matters, select the right environments, run specialist agents, and return useful product feedback without waiting for someone to manually work through the same regression path again.
That framing fits a broader industry shift. Deloitte’s 2026 software outlook argues that software creation is getting faster and cheaper in the
agentic AI era, and that companies integrating agentic capabilities across the software development life cycle, including testing, may unlock more
value.
The future of mobile quality is not a faster test script creator. It is a faster feedback loop.
Agentic Quality Engineering needs a real definition before the market sands it down into another AI slogan.
Agentic Quality Engineering is a developer-initiated, machine-led software execution and validation model where AI interprets requirements, decides what to validate based on change and risk, executes across environments, invokes specialist agents, and returns actionable product feedback directly into the development workflow. For mobile, that means the quality loop can start in the developer’s AI workspace instead of with a person picking up a phone or waiting on someone else to do it.
That last sentence is the heart of this post.
For years, mobile quality has been trapped behind the same bottleneck. Somebody had to be the person with the device. Somebody had to open the app, navigate the flow, compare the result, and communicate what happened.
Sometimes that works just fine. Sometimes it even feels faster than setting up anything more formal. That is exactly why the habit survives.
But it never scales well. And once developers start working inside AI-native environments that already understand code changes, task intent, and product context, the old ritual starts to look more like latency and less like discipline.
This is what AI mobile testing looks like when it matures beyond script generation.
The important shift is not that AI can generate tests. Plenty of people will say that, and most of them will say it like they invented gravity. We already have tools, like Kobiton’s Appium Script Generator, that help translate manual intent into executable tests. That is not the breakthrough. The breakthrough is what happens when that capability becomes part of a larger, machine-led validation loop.
The real shift is that the point of initiation has moved.
Developers are increasingly working inside environments that do more than help write code. They hold the context of the change. They can see the requirement, the diff, the history, the likely blast radius, and the next logical action. Once that happens, quality no longer has to wait for a downstream ticket that says “ready for testing.”
It can start in the same place the software changed.
That is a bigger shift than it sounds. When the developer can initiate validation directly from the AI workspace, the quality loop gets pulled upstream. The handoff starts to disappear. And the old assumption that a human has to pick up a device before anything meaningful can happen stops looking like process and starts looking like waste.
Mobile is the first place this model cracks because mobile still carries one ugly tax that web teams increasingly avoid.
Physical interruption.
The industry calls it manual mobile testing. But that label understates the problem. The problem is not just that humans are doing the work. It is that the work is physically disconnected from where software is now being created.
A web developer can often validate quickly in the same working context. A mobile developer often cannot. The old loop still asks for one of two things. Either the developer picks up the phone sitting next to the laptop and taps through the change by hand, or the developer pings someone else to do it.
That creates delay. It creates context switching. It creates a hidden queue nobody bothers to name because everyone has normalized it.
This is why manual device-in-hand mobile testing is such an important target. It is not just another outdated workflow. It is the one most obviously misaligned with where software work is heading.
Manual mobile testing is usually discussed like a labor problem.
It is bigger than that. It is a workflow problem.
When a developer ships a change and the next step requires a physical interruption, the human becomes the transport layer between code change and validation. The code is ready. The requirement exists. The build is available. But progress is blocked until a person does the carrying, literally or organizationally.
That is why manual mobile testing is really workflow latency.
It hides inside habits that look normal. Grabbing the nearest iPhone. Asking a tester to run the latest build. Waiting for someone to confirm that the login still works, the notification still fires, the camera flow still behaves, the permission popup still resolves cleanly.
Each individual action feels small. Together they create a quality model that slows down precisely where development is speeding up.
The first work to get absorbed is not deep exploratory thinking. It is repetition.
But repetitive execution? Rerunning known regressions? Mobile regression testing has always been the tax nobody budgets for but everybody pays. The bulk of mobile app manual testing is verifying the same flow after every change because somebody has to.
That is exactly the sort of work Agentic Quality Engineering is built to take off human hands.
Not because the work is unimportant. Because it is important and predictable. Because it happens over and over. Because it consumes attention that should be spent on exceptions, edge cases, ambiguous product behavior, and the places where real human judgment still matters.
When people hear “the end of manual testing,” they picture a machine replacing every person in the room. That is the wrong image.
The real image is simpler. Humans stop spending their day tapping on a physical phone proving for the hundredth time that the same basic flow still works after another incremental change.
People have been talking about mobile test automation for a long time. Most of that conversation was really about taking a human flow and encoding it into a script, then maintaining the script forever like a bonsai tree with anger issues.
Useful? Sure.
Enough? No.
The reason this shift is happening now is not that teams suddenly got religion about automation. It is that AI workspaces have moved quality initiation upstream, while teams need richer risk signals than device access alone can provide. Device-in-hand workflows mainly solve one problem: access to a real device and an environment. Agentic Quality Engineering is trying to solve a bigger one: how to turn change, context, and risk into meaningful feedback directly inside the development loop.
Deloitte’s 2026 software outlook points in the same direction. It argues that software creation is becoming faster and cheaper in the agentic era and that teams using agentic AI across the full software development life cycle, including requirements, deployment, monitoring, and testing, may unlock more value.
That matters because the old model was built for a world where software changed more slowly, handoffs were tolerated, and “someone checked it on a phone” passed for a quality strategy.
That world is ending.
Traditional automation helped. It just never solved the whole problem.
It still relied on writing and maintaining scripts. It still required specialized skills. It still broke in annoying ways. It still lived in a separate tooling universe from the place where the developer made the change. And it often treated execution as the finish line instead of the middle of a feedback loop.
That is why so many teams never crossed over fully. Not because they did not want quality. Because the cost of getting there stayed too high. And the cost of remaining there was understated.
Agentic Quality Engineering changes the system around the workflow, not just the mechanism of execution. It makes the question less about “Can we automate this step?” and more about “Can the system decide what matters, run it in the right place, and tell us something useful quickly enough to shape the next move?”
That is a much bigger leap.
This is the step change.
For a large portion of mobile test cases, the developer no longer needs to pick up the phone or wait for a manual tester to do it. The system can initiate the real-device workflow directly from the AI workspace and return useful findings in the same place the change happened.
That changes the economics of quality immediately.
The bottleneck stops being, “Who is going to test this?” and becomes, “What deserves escalation, what is high risk, and what feedback should change the release decision?”
That is a much better bottleneck to have.
This is not just a story for elite automation teams with polished DevOps decks and a slide called “quality transformation.”
A large part of the market, including major enterprises, still relies heavily on device-in-hand testing. The exact numbers are messy, but one widely shared estimate pegs the global QA workforce at roughly 6 to 9 million professionals, based on developer-to-tester benchmarks and public workforce assumptions.
You do not need fake precision to see the point.
Even if only a fraction of that workforce is deeply involved in mobile, and only a fraction of that mobile segment still operates through device-in-hand workflows, you are still talking about a huge volume of repetitive human effort spent validating the same flows over and over.
That is the opportunity.
Not “replace all testers.”
Replace the assumption that meaningful mobile validation must begin with a human holding a phone.
Device-in-hand testing survives for the same reason bad habits usually survive. It is easy. It is familiar. It is available right now.
No procurement cycle. No orchestration logic. No workflow redesign. No retraining. Just grab the phone and go.
That makes it a stubborn competitor, especially in organizations where quality is under pressure but not yet rethought. A mobile testing platform with real devices was always the better answer. But if the experience felt slower or less natural than reaching for the phone on the desk, many teams never gave it full support or made it the default. That is what changes when the developer can initiate validation directly.
Once the system becomes easier to start than the habit, the habit starts to die.
When I built Wingman, an AI-powered business card and badge scanner, our testing infrastructure was not exactly a tribute to modern systems design.
It was a Galaxy S8+ and an iPhone 8 being passed around by the developers.
That was not because we were unserious. It was because that is how a lot of teams actually operate when speed, budget, and practicality collide. You test with what you have. You validate the key paths. You keep moving.
That is the point. Device-in-hand behavior is not a startup problem or an enterprise problem. It shows up everywhere. The scale changes. The habit often does not.
Once you stop treating quality as a handoff and start treating it as part of the development loop, the workflow changes fast.
The system sees the code diff. It interprets the relevant requirement or acceptance criteria. It decides what needs to be validated based on change and risk. It selects the right environment mix. It executes. It invokes specialist agents where needed. Then it returns findings directly into the developer's workflow.
That is the new loop.
Not because the machine is magical. Because it no longer waits for a human courier.

In the old model, the path was linear and human-gated.
Code changed. A build was created. Somebody notified QA. Somebody ran the flow. Somebody reported back. If something looked wrong, the loop restarted with more human routing in the middle.
In the new model, the system handles the routing.
A code change lands. The AI workspace already has the context. It can map the change to likely risk. It can identify which flows matter. It can choose an emulator where speed is important and a real device where fidelity matters. It can invoke visual, accessibility, performance, or failure-analysis agents in parallel. It can gather evidence and return something more useful than a pile of logs.
It can return a point of view.
This is what continuous testing actually looks like when it is not just a CI pipeline label.
The loop closes without a human manually carrying it from step to step. That does not mean humans disappear. It means they enter later, where they add more value.
The person is no longer responsible for keeping the process moving. The system is.
That is a meaningful upgrade.
This is the nuance people mangle when they are in a hurry to sound futuristic.
Real-device testing stays.
Real devices remain the final authority when confidence matters. Mobile apps still live in the mess of actual hardware, operating system versions, sensors, permissions, biometrics, notifications, network variability, and device-specific behavior that never quite behaves the same way twice just because someone wrote a clean requirement.
Agentic Quality Engineering does not remove that reality.
What it removes is the assumption that a human must physically hold the device to trigger and execute the default workflow.
Real devices stay. Device-in-hand does not.
This is not an argument against virtual devices or an argument that every check must start on physical hardware. The point is simpler. Real devices still matter when confidence matters. What changes is the assumption that a human has to be the one holding the phone to start the default validation loop.
If you are a device-in-hand tester, this shift matters directly to your role.
But it is not an extinction story. It is a role-change story.
The opportunity now is to move beyond repetitive execution and into higher-value work: remediation, judgment, governance, and requirement refinement. AI-driven QA does not remove the need for human expertise. It raises the value of expertise applied in the right places.
The strongest testers in this next phase will be the ones who stop defining their value as execution speed and start defining it as domain knowledge and judgment.
They will know where the system is likely to fail. They will know which requirements are too vague to trust. They will know when a result is high confidence and when it needs human escalation. They will know where the product risk lives, where the policy boundaries are, and where the machine needs feedback to get better.
That is a better job.
It is also a more durable one.
In Agentic Quality Engineering, humans move from execution to remediation, judgment, governance, and requirement refinement. That is not a downgrade. It is where the skills and leverage lie.
For device-in-hand testers, the path forward is to move from manual execution into reusable automation and, over time, into developer-initiated validation workflows.
That transition does not need to happen all at once. The first step is turning repeated human knowledge into automation assets that the system can reuse. The next step is connecting those assets more directly to the developer workflow, so validation starts where the code changes happen.
Kobiton’s Appium Script Generator fits that first step. The next step is connecting those automation assets directly to the developer’s AI workspace, so validation starts where the code changes happen.
This is the category shift underneath the workflow shift.
The old model produced status artifacts.
Passed. Failed. Blocked. Needs retest. Step 14 passed. Step 22 did not.
Useful, sometimes. Sufficient, rarely.
The output of Agentic Quality Engineering should not be a longer pass/fail report. It should be software issues, requirement gaps, risk signals, and release confidence.
That changes the category.
Because teams do not actually care about test cases for their own sake. They care about whether the product is safe to ship. They care whether the requirement survived contact with execution. They care whether something regressed, whether the user experience degraded, whether the evidence supports confidence or caution.
The output is product feedback, not test results.
That sounds small. It is not. It is the difference between measuring the process and informing the decision.
If you lead engineering, product, or quality, the wrong response is to ask how to preserve the old manual workflow with some AI sprayed over the top.
The right response is to redesign the quality loop around where software work now begins.
Start with requirements and acceptance criteria. If they are vague, you are feeding ambiguity into the system and pretending it is automation.
Define orchestration rules. Decide what should run on virtual devices and what must touch a real device before anyone trusts the result.
Instrument the evidence layer. That only works if the platform can capture rich device evidence, including screenshots, view trees, logs, metrics, crash data, and video, so the system is making decisions from real context instead of partial signals.
Set escalation boundaries. Decide where the system can proceed on confidence and where it must stop for human judgment.
Change the metrics. Stop obsessing over pass rates as the main event. Measure feedback speed, exception resolution, requirement quality, escaped defects, and release confidence instead.
That is how you prepare for this shift. By making quality faster, smarter, and more connected to the place where software is actually created.
Manual device-in-hand mobile testing will not disappear overnight. Manual mobile testing still has a role. But it is no longer the starting point.
That is the real shift.
Agentic Quality Engineering does not remove humans from mobile quality. It does not remove the need for real devices. It removes human device-in-hand execution as the center of gravity.
The developer starts the loop from the AI workspace. The system decides what matters. Real devices remain essential. Humans remain essential. But the human no longer has to be the one physically executing the default path every time software changes.
That changes what kind of work humans do, where they step in, and what is actually worth paying human judgment for.
The future of mobile quality is not a faster test generator. It is a faster feedback loop.
In Agentic Quality Engineering, humans stop picking up devices and tapping on screens and start governing the exceptions.