I’m building Etnamute — a system where AI agents write mobile apps from idea to App Store. User interview, market research, spec, design, code, marketing — all automated. My role came down to one thing: open the finished app and check if it works.
I used to joke that while my AI agents do the real work, I’m their errand boy — the tester. The joke stopped being funny around the third app, when I toggled the theme to dark and the screen stayed white. The agent swore everything works. Fifty tests green. The app — white.
I decided it was time to build one more agent. A junior QA engineer that would take over testing. Here’s how we raised it — and what it taught us.
Suspect number one: unit tests
First thing Claude does when you ask it to test an app — it writes unit tests. Lots of them. Fifty in five minutes. Every state action verified. Every screen renders without crash. Every button press calls the right handler.
Looks solid. There’s one problem.
The test confirms that setTheme('dark') writes 'dark' to settings. Test passes. Screen stays white. The value got saved — the UI didn’t re-render. From the code’s point of view, everything is correct. From the user’s point of view — the dark theme toggle doesn’t work.
I asked Claude what’s going on. It said: “Visual theme application is a design gap.” Not a bug. A gap.
That’s when I got the fundamental problem. Claude was testing code. It should have been testing promises.
The case of UI promises
Every button on screen is a promise to the user. A toggle labeled “Dark Theme” promises the screen will go dark. A currency selector “€” promises all prices will show euros. A “Save” button promises data will be saved.
A human tester gets this intuitively. They look at the screen, tap the toggle, look at the result. If the toggle says one thing and the screen shows another — that’s a bug. Not a “design gap.” A bug.
I turned this into what I call an interaction map. Before writing any test, Claude has to go through every screen and answer one question for each interactive element: what does the user expect when they look at this? Not what the handler code does — what the button label promises.
Then Claude reads the code and compares. User expectation vs actual behavior. If they don’t match — it’s a broken promise.
The difference is subtle but it matters. A “Notifications” switch saves a setting for a background service — no visible change on screen, and the user doesn’t expect one. That’s fine. But a “Dark Theme” toggle with no visible effect — that’s a broken promise. The label says one thing, the screen shows another.
The first run with this rule found something real. A subscription tracker app had a currency setting. User picks euro — the spending widget on home shows €15.49. The subscription card right next to it — still $15.49. Same screen, two components, one has a hardcoded dollar sign. The interaction map caught it because it asked: “currency change — which screens does it affect?” — and checked each one.
Could you write a unit test that catches this? Sure. But you’d have to predict that this specific component would forget to read the currency setting. You can’t write a test for every possible oversight. A human tester solves this differently: change the currency, look at the screen. If something didn’t update — you see it right away. No prediction needed.
You need hands. Well, fingers.
The interaction map is a plan. But the plan needs execution. Someone has to open the app and tap through it.
That’s what Maestro does — a UI testing framework for mobile apps. You describe a scenario in YAML: launch, tap here, type text, check what appeared. Maestro runs it on a real iOS simulator or Android emulator.
The key difference from unit tests: Maestro tests the built app. Same binary the user will install. If a library crashes on launch — Maestro sees it. If the keyboard covers the submit button — Maestro can’t tap it. If an animation leaves an element in the wrong spot — the check fails.
Here’s what it looks like:
- launchApp:
clearState: true
- extendedWaitUntil:
visible: "Add Subscription"
timeout: 10000
- tapOn:
id: "catalog-netflix"
- tapOn:
id: "btn-submit"
- extendedWaitUntil:
visible: "Netflix"
timeout: 15000
- takeScreenshot: home_with_netflix
Every action waits for a specific result. Not sleep(3) — “wait until this text shows up.” If it doesn’t — the test fails and you get a screenshot of what the screen actually showed.
The scenarios come from the interaction map. Visual effects — verified by Maestro. Cross-screen effects (currency change in settings affects home and statistics) — navigation between screens plus screenshots. Effects with no visual feedback — regular unit tests. Each effect type is covered by the right method.
The whole thing is wrapped in one shell script — smoke.sh. It builds the app, boots a headless simulator (no GUI window), installs the binary, runs all scenarios, kills processes when done. One command. The agent is not allowed to run these steps by hand — if the script breaks, fix the script.
The third eye: visual check
Maestro confirms that taps worked and text appeared. But it doesn’t know how the app should look. The button might be there, but the color is wrong. Spacing is off. Layout doesn’t match the mockup.
Claude can look at images. After all scenarios pass, it opens every saved screenshot and compares against three sources of truth:
Stitch mockups. If the app was designed with Google Stitch, the original screen designs are saved next to the code. Claude compares the real screenshot to the intended layout — element placement, proportions, visual hierarchy.
Design tokens. A DESIGN.md file has exact colors, font sizes, spacing values. Claude checks if the screenshot matches.
The interaction map. If the scenario changed currency to euro, Claude checks that the screenshot actually shows euro signs everywhere, not dollars.
That’s exactly how the currency bug got caught. Maestro confirmed settings accepted the choice. The interaction map pointed to which screens should change. Claude looked at the screenshot — and saw that one component didn’t update.
Three layers, one QA engineer
Put it all together and you get something that looks a lot like how a real QA team works:
Planning. Read the spec. For each UI element, decide what the user should see. Flag anything suspicious before testing starts. That’s the interaction map.
Execution. Open the app. Go through every scenario. Check that buttons work. Data survives restarts. Bad input shows error messages. That’s Maestro.
Visual review. Compare each screen to the mockup. Check colors, typography, spacing. If you changed a setting — make sure the change reached every screen. That’s Claude looking at screenshots.
Each layer is a separate Claude Code skill — a markdown file that other skills can use. /build-app runs the full testing cycle after building. /improve-app uses just the interaction map and visual review — enough to check the impact of small changes. /test-app runs everything end to end.
How this is different from “AI writes tests”
In practice, this approach catches bugs in four categories that standard AI testing misses:
Broken promises. Theme toggle that saves the setting but doesn’t re-render. Currency selector that updates some components but not others. Search button with an empty handler.
Runtime crashes. Libraries that compile fine but crash on launch — because they need features not available in the user’s environment. TypeScript compiler and bundler see nothing. Maestro catches the crash on first launch.
Cross-screen mismatch. A setting that reaches one screen but not another. Data visible on one tab, missing on the next.
Design drift. Layout that shifted from the mockup. Colors that don’t match the design system. Inconsistent spacing between similar screens.
Nothing exotic. Normal problems every mobile developer deals with. But they’re invisible to unit tests — because unit tests check pieces in isolation. The interaction map checks promises. Maestro checks the assembled app. Visual review checks the visual contract.
Try it
The testing pipeline is part of Etnamute, the AI app factory I described earlier.
interaction-map/SKILL.md— builds a test plan from UI analysisvisual-review/SKILL.md— compares screenshots to mockups and design systemmaestro/SKILL.md— scenario templates, Expo Router gotchastesting/SKILL.md— runs all three layers in orderscripts/smoke.sh— build, simulator, Maestro, cleanup — one command
The approach is not tied to Expo or React Native. The interaction map works with any UI framework. Maestro supports iOS and Android. Visual review works with any screenshot. The idea — test promises, not code — works anywhere.
Honest note: this is not a silver bullet. The agent skips steps sometimes. Complex forms need extra work. And it all depends on how well your UI elements are addressable — if a button has no test ID, Maestro can’t tap it.
But between “AI runs tests and declares victory” and “AI opens the app and checks what the user sees” — that’s where most real bugs live.