Testing AI Software: Using Chao1 to Set Ship Criteria

Reading Time : 18min read

The first beta build of Scriptless Test Automation survived thirty minutes in a customer’s hands, proving that testing AI software is a very different beast. One innocuous home-screen widget blew up the flow, and every patch we shipped uncovered a fresh cliff. Test suites were green, prospects were livid, the Board was impatient, and team morale cratered. We realized we weren’t testing software; we were mapping the behavior of a living system.

AI features live in uncharted country – no binary oracles, no neat coverage metrics, just an endless frontier of “mostly-works” edge cases. We needed a compass that could tell us how much unseen risk remained and answer the only question that mattered: “How close are we to done?”

Our compass emerged from an unlikely place: statistical species-richness estimation. By treating each new failure pattern as a “species,” Chao1 gave us a statistically defensible read on when we’d exposed enough of the unknown to ship with confidence.

What follows is how we wired Chao1 into our exploratory pipeline, the signals that matter, and why any team testing AI software needs a similar barometer for moving from testing to delivery.

Why “Pass/Fail” Breaks Down When Testing AI Software

Traditional software behaves like a vending machine: put in an input, expect the same candy every time. That makes binary testing natural – one scenario, one verdict, green or red. AI flips that model on its head. The algorithm is learning-based, probabilistic, and often stateful. Same input, different context, slightly different output. A single pass/fail flag can’t capture that range of behaviors.

We learned that lesson the hard way while testing AI software in Scriptless Test Automation. The first beta survived half an hour before a customer’s home-screen widget hijacked the flow and exposed a bug none of our green bars anticipated. Every hot-fix we rushed out surfaced a brand-new failure pattern. Sales saw an unending whack-a-mole; Engineering saw a matrix of permutations exploding faster than we could script; the Board saw slipping dates.

Three fault lines opened:

  • Non-determinism. Scriptless’s model re-evaluates UI context on each run. The exact same tap could resolve to a different element if layout or state shifted by a pixel. Yesterday’s “pass” became today’s “fail” with no code change.
  • State-space overload. Mobile apps differ wildly – deep links, widgets, dark mode, locale. Add variations in Android skins and iOS versions, and our traditional test matrix collapsed under its own weight.
  • Probabilistic quality. Customers needed the engine to choose the right element 99% of the time, not 100%. Yet pass/fail reporting couldn’t express “acceptable miss rate,” so we oscillated between over-testing and under-testing.

With every new edge case, the question grew louder: How close are we to done?

The result? False confidence when the board is watching and prospects are churning. We needed a yard-stick built for uncertainty. That’s why we turned to Chao1: instead of asking “Did every test pass?” it asks “How many undiscovered failure types still lurk out there?” – a statistical compass that finally let us make sane ship decisions in an AI world.

Meet Chao1 – An Ecologist’s Trick for Testing AI Software

Chao1 was born in ecology (Elizabeth Chao, 1984) to estimate hidden species in a rainforest sample. Swap “species” for “failure patterns” and you have a practical way to ask:

“How many AI software bugs are still hiding beyond our test suite?”

Key data points:

  • Uniques (F1) – bugs seen once
  • Duplicates (F2) – bugs seen twice
  • Chao1 estimate – lower-bound of total bug “species”

One line of math turns a messy defect list into a single, converging metric the whole team can rally behind.

How We Turned Daily Chaos into Predictable Insight

Diagram 1: Chao1 Process for Testing AI Software

Building confidence in Scriptless Test Automation meant running it against the unknown every single day – then turning that chaos into data we could reason about. The loop we settled on is captured in Diagram 1, but the spirit matters more than the boxes and arrows: expose the engine to fresh terrain, watch where it stumbles, and convert each stumble into a signal.

1. Randomly choose a mobile app

We started by pulling one app at random from a pool that grew to more than a hundred titles across finance, retail, social, gaming, and a few wonderfully odd utilities. Randomization wasn’t a gimmick; it inoculated us against unconscious cherry-picking. If the algorithm was going to fail, we wanted to see it on the apps our customers might test tomorrow, not the tidy demos we’d rehearsed in the lab.

2. Execute exploratory tests

A tester spent on average 4 hours wandering the new app with human curiosity: change orientation, trigger push notifications, swipe through onboarding, add a widget, flip dark mode, kill and relaunch. Every action was captured in Kobiton’s Session Explorer and converted to a Test Case using Kobiton’s Mobile Test Management – click elements, element hierarchy, screenshots. Those trails of breadcrumbs became raw material for the next step.

3. Run Scriptless Test Automation from the session

With a single click, we turned the exploratory session into an automated replay: Scriptless Test Automation examined the recorded steps, inferred each selector, and attempted to replicate the flow on the same device. The beauty – and the terror – of this mechanic is that the AI makes decisions in real time. If the UI shifts even a hair, the model has to choose a new element. That’s where the interesting failures hide.

4. Analyze and classify every failure

When a run went off the rails, the tester captured the artifacts – screenshots, logs, selector fields – and asked one question: Have we seen this exact failure pattern before? If they were Elizabeth Chao back in the 1980s, they would have been asking, have I seen this species of plant before?

  • If yes, the bug was mapped to its existing “species.”
  • If no, we opened a Jira ticket, documented the reproduction steps, and christened a brand-new species.

Those counts flowed straight into a living Google Sheet. Two columns mattered most: Uniques (bugs seen only once) and Duplicates (bugs seen exactly twice). Plug those into the simple Chao1 formula and you get a lower-bound estimate of how many distinct failure types still lurk undiscovered.

Then we looped back to Step 1 with the next random app. One tester could complete one to two full cycles in a day; a small squad kept the engine humming five days a week for six straight months.

Watching the numbers tell a story

Diagram 2: The team’s results graph for Scriptless Test Automation

The time-series chart in the post became our morning stand-up ritual. A quick decoding:

  • Blue bars (App Count): how many unique apps we’d exercised to date.
  • Green bars (Test Steps): the sheer volume of interactions replayed by Scriptless.
  • Dark green line (Found): cumulative distinct bugs actually discovered.
  • Purple line (Estimated Defects): Chao1’s prediction of total bug species.
  • Light green line (Low at 99% Confidence Interval): the low certainty envelope around that estimate.
  • Red line (High at 99% Confidence Interval): the high certainty envelope around that estimate.

Early on, the red band looked like a seismograph – swinging wildly as each new species moved the goalposts. But around week three a pattern emerged: blue and green kept climbing, light-green rose more slowly, and the purple line began to flatten. Most importantly, the red band narrowed from ±40% to under ±10%. That convergence did two things:

  1. Restored morale. Engineers could see, statistically, that the mountain of “unknown unknowns” was shrinking.
  2. Enabled clear trade-offs. When the high-confidence upper bound dropped below a threshold we were willing to carry into production, we knew it was time to ship. We will dive into this next.

Knowing When to Ship Statistical Stop-Rules

The Chao1 dashboard did more than plot curves – it put a finish line in sight. Each morning the team watched the high and low ends of the 99% confidence band inch toward one another, and morale flipped from “Will this ever end?” to “We’re closing the gap.” One engineer nailed the vibe: “It feels like we’re mining Bitcoin, but for bugs.” Every new species we uncovered pushed the estimate down and the certainty up.

Two weeks in, the convergence became unmistakable. That’s when our product manager asked the question every team secretly fears: “So…when do we stop?”

In traditional releases, the answer is binary: no open SEV-1 or SEV-2 defects, green light. But when you’re testing AI software, the world is probabilistic. We weren’t chasing absolute correctness; we were chasing a tolerable likelihood that customers wouldn’t unearth surprises we hadn’t already catalogued.

Defining “good enough” in statistical terms

After a short but animated Zoom debate, we set a rule:

Ship when the cumulative bugs discovered equals the Chao1 point estimate and the 99 % confidence interval has effectively collapsed (high ≈ low).

In plain English: once the model predicted no more undiscovered species – and the math said the margin of error was negligible – we were done. No more whack-a-mole, just diminishing returns.

On September 10, 2024, the Android‐native line crossed that threshold. The chart looked beautiful: purple estimate flat, red and green confidence lines hugging it, light-green “Found” count touching the plateau. The team rang a virtual gong in Slack and queued the celebratory gifs.

Diagram 3: Slack message to communicate status to the wider Kobiton team.

Found: shows that at this point we had unearthed 16 of the Chao1 predicted 21 bugs.

F1: 7 and F2: 5 lists the uniques (bugs seen once) and duplicates (bugs seen exactly twice) that power the Chao1 formula.

Revisit actions: Indicates that at this point we had processed 2,144 test steps in our Scriptless Test Automation.

Confidence band: Confident interval 99% of (9, 33) means the model expects the true total number of defect species to fall between 9 and 33 with 99-percent certainty.

Trust, but verify

Celebrations are fun; regressions in production, less so. I asked the crew to run ten additional random apps before we cut the build. Think of it as a statistical audit: could we add fresh evidence without moving the estimate? App #7 (the Enlighten app) broke the streak – one new species slipped through. That single data point nudged the Chao1 curve just enough to matter.

So we tightened the stop-rule:

  • Ship only after ten consecutive random apps yield zero new species.

It added three extra days and about 6,000 additional test steps, but the second run held firm. Zero new bugs, confidence interval unchanged, estimate steady. This time the gong felt earned.

What you can borrow when testing AI software

  • Make the rule explicit before the curve flattens. Otherwise optimism will tempt you to declare victory early.
  • Use a secondary validation loop. Ten clean runs was our choice; pick a number that balances risk with schedule for your context.
  • Celebrate data, not dates. The calendar said we were late; the dashboard said we were ready. The latter won – and production proved it right.

Statistical stop-rules won’t give you perfection when testing AI software, but they will give you a defensible moment to move from exploration to delivery. In the messy, probabilistic world of AI features, that’s as close to certainty as product teams are likely to get – and sometimes, that’s exactly what you need to ship with confidence.

Field Notes & Starter Kit When Testing AI Software

The six-month run left us with a feature users love and a repeatable playbook for anyone testing AI software. Here’s what we’d keep – and what we’d tweak – if we started tomorrow.


What Mattered Most

InsightWhy It MattersHow To Apply
Visible progress fuels teams.A shrinking confidence band gave every engineer an immediate sense of momentum. One dev compared it to “mining Bitcoin, but for bugs.”Publish the Chao1 dashboard where everyone can see it. Let the numbers drive the stand-up, not the other way around.
Freeze the code until exit criteria are met.Shipping hot-fixes mid-study skews the species counts and invalidates the estimate.Hold all fixes behind a feature branch; merge only after you’ve cleared the stop-rule.
Define “bug” once–then live by it.Inconsistent triage creates phantom species and inflates the estimate.Draft a one-page guideline that spells out what is and isn’t a new failure pattern. Revisit only if the team hits an edge case you truly missed.
Beware the “Success” monoculture.If 95% of events resolve to a single “Success” species (think a pine-tree forest), Chao1’s value drops.Use Chao1 when failure types are sparse but meaningful. If your success rate hovers near perfection, switch to a severity-weighted metric.
Choose the right estimator.Chao1 is a non-parametric richness model (Elizabeth Chao, 1984). It’s great for sparse, skewed data, but not the only game in town.Alternatives worth exploring: Chao2 (incidence-based), ACE (Abundance-based Coverage Estimator), Jackknife 1/2, and Good-Turing. Pick the model that best fits your sample style and data distribution.
Create diversity in the input.Broader input variety surfaces failure patterns your core-domain tests miss, giving Chao1 a truer read on hidden risk.Blend daily random-app sampling with a curated set of domain-critical scenarios and track each stream in its own Chao1 panel.

Your Starter Kit

  1. Grab the code. Drop the TypeScript snippet (you’ll see it in the appendix) into a utility module. The only dependency is jstat for the z-score.
  2. Bootstrap a spreadsheet. Columns for Date, App Name, Species ID, Occurrences. A simple pivot gives you F1/F2 counts; the script does the rest.
  3. Set a stop-rule up front. We used “Chao1 plateau + 10 clean apps.” Adjust the confidence level or app count to match your risk tolerance.
  4. Automate the chart. Pipe the script’s output into a Google Sheet so everyone sees the same truth.
  5. Run one dry cycle. Before you stake the release on it, practice on a throw-away branch. You’ll surface gaps in taxonomy and tooling long before the numbers matter.

Statistical stop-rules won’t promise perfection, but they will give your team an unambiguous, data-driven moment to ship. In a world where AI behavior shifts under our feet, that kind of clarity is hard currency – worth more than Bitcoin, if you ask the team that mined those bugs. And, if you want to read more about AI in Software Testing, check out our AI in Software Testing: A Comprehensive Guide.

Appendix – Reference Testing AI Software Implementations

TypeScript + jstat implementation

This snippet drops straight into any Node or browser-based tooling pipeline. Feed it a list of species (each with a name and occurrence count) and a desired confidence level; it returns Chao1’s point estimate plus upper- and lower-bounds, F1/F2 counts, and the raw species-observed metric. Because it relies only on the lightweight jstat library for z-scores, you can invoke it in a Jest test, a build-time script, or even a CI/CD step that fails the pipeline when the upper-bound exceeds your release threshold. Copy, paste, npm i jstat, and you’re in business.

import * as jst from 'jstat'

let jstat: any = jst

export interface Specie {
  specieName: string
  occurrences: number
}

export interface Chao1Data {
  species: Specie[]
  confident: number
}

export interface Chao1Result {
  Sobs: number
  chao1Number: number
  lowerBound: number
  upperBound: number
  confidentInterval: number
  F1: number
  F2: number
}

export function calculateChao1(data: Chao1Data): Chao1Result {
  const { species, confident } = data;

  const Sobs = species.length;

  const F1 = species.filter(d => d.occurrences === 1).length;
  const F2 = species.filter(d => d.occurrences === 2).length;

  let S_chao1: number;
  if (F2 === 0) {
    S_chao1 = Sobs + (F1 * (F1 - 1)) / (2 * (F2 + 1));
  } else {
    S_chao1 = Sobs + (F1 * F1) / (2 * F2);
  }

  let variance: number;
  if (F2 === 0) {
    variance = Math.pow((F1 * (F1 - 1)) / (2 * (F2 + 1)), 2);
  } else {
    variance = (F1 * (F1 - 1)) / (2 * (F2 + 1)) + Math.pow(F1, 2) * Math.pow(F1 - 1, 2) / (4 * Math.pow(F2, 2));
  }

  const standardError = Math.sqrt(variance);

  const zScore = jstat.normal.inv(1 - (1 - confident) / 2, 0, 1);

  const ciLower = S_chao1 - zScore * standardError;
  const ciUpper = S_chao1 + zScore * standardError;

  return {
    Sobs,
    chao1Number: S_chao1,
    lowerBound: ciLower,
    upperBound: ciUpper,
    confidentInterval: confident,
    F1: F1,
    F2: F2
  };
}

export function calculateMothur(data: Chao1Data): Chao1Result {
  const { species, confident } = data;
  const Sobs = species.length;
  const F1 = species.filter(d => d.occurrences === 1).length;
  const F2 = species.filter(d => d.occurrences === 2).length;

  let varChao1 = (F1 * (F1 - 1)) / (2 * (F2 + 1));
  if (F1 > 0 && F2 > 0) {
    varChao1 = (F1 * (F1 - 1)) / (2 * (F2 + 1)) +
      Math.pow(F1, 2) * Math.pow(2 * F1 - 1, 2) / (4 * Math.pow(F2 + 1, 2)) +
      Math.pow(F1, 4) * F2 * (F1 - 1) / (4 * Math.pow(F2 + 1, 3));
  } else if (F1 > 0 && F2 === 0) {
    varChao1 = (F1 * (F1 - 1)) / 2 +
      Math.pow(F1, 2) * Math.pow(2 * F1 - 1, 2) / 4 -
      Math.pow(F1, 4) / (4 * varChao1);
  } else if (F1 === 0 && F2 >= 0) {
    varChao1 = Sobs * Math.exp(-Sobs / Sobs) * (1 - Math.exp(-Sobs / Sobs));
  }

  const z = 1 + confident;
  const lowerBound = varChao1 + (F1 * (F1 - 1)) / (2 * (F2 + 1)) * Math.exp(-z * Math.sqrt(Math.log(1 + varChao1 / Math.pow(varChao1 - Sobs, 2))));
  const upperBound = varChao1 + (F1 * (F1 - 1)) / (2 * (F2 + 1)) * Math.exp(z * Math.sqrt(Math.log(1 + varChao1 / Math.pow(varChao1 - Sobs, 2))));

  return {
    Sobs,
    chao1Number: varChao1,
    confidentInterval: confident,
    lowerBound,
    upperBound,
    F1,
    F2
  };
}

Google Sheets macro
If your team lives in spreadsheets, the Apps Script macro brings Chao1 to them with zero engineering lift. Select a range containing your species IDs (one per row), call =chao1WithCI(A2:A200, 0.99) in a cell, and Sheets returns both the estimate and the 99% confidence interval. Because everything stays inside Google Workspace, PMs and execs can tweak data, see numbers update instantly, and even chart the trend without leaving the sheet, perfect for lightweight experiments or teams without TypeScript in their stack.

function chao1WithCI(range, confidenceLevel) {
  // Get the values from the range
  const values = range.flat();
  
  // Count the frequency of each value
  const freqMap = {};
  values.forEach(value => {
    if (value in freqMap) {
      freqMap[value]++;
    } else {
      freqMap[value] = 1;
    }
  });

  // Calculate f1 and f2
  let f1 = 0;
  let f2 = 0;
  let observedSpecies = 0;
  
  for (let key in freqMap) {
    if (freqMap[key] === 1) {
      f1++;
    } else if (freqMap[key] === 2) {
      f2++;
    }
    observedSpecies++;
  }

  // Calculate Chao1
  let chao1;
  if (f2 === 0) {
    chao1 = observedSpecies + (f1 * (f1 - 1)) / 2;
  } else {
    chao1 = observedSpecies + (f1 ** 2) / (2 * f2);
  }

  // Calculate the variance for the confidence interval
  let variance;
  if (f2 === 0) {
    variance = 0;  // Cannot compute variance properly without f2
  } else {
    variance = f2 * ((f1/f2)**4 + (f1/f2)**3 + (f1/f2)**2) / 4;
  }

  // Calculate the z-score for the given confidence level
  const z = Math.abs(NormalDistInv((1 + confidenceLevel) / 2));

  // Calculate the confidence interval
  const ciLower = chao1 - z * Math.sqrt(variance);
  const ciUpper = chao1 + z * Math.sqrt(variance);

  return Chao1: ${chao1}, ${Math.round(confidenceLevel * 100)}% Confidence Interval: (${ciLower}, ${ciUpper});
}

// Function to calculate the inverse of the normal distribution
function NormalDistInv(p) {
  if (p < 0.5) {
    return -RationalApproximation( Math.sqrt(-2.0 * Math.log(p)) );
  } else {
    return RationalApproximation( Math.sqrt(-2.0 * Math.log(1.0 - p)) );
  }
}

function RationalApproximation(t) {
  const c0 = 2.515517;
  const c1 = 0.802853;
  const c2 = 0.010328;
  const d1 = 1.432788;
  const d2 = 0.189269;
  const d3 = 0.001308;
  return t - ((c2 * t + c1) * t + c0) / (((d3 * t + d2) * t + d1) * t + 1.0);
}

CTA AI in testing certification course

Interested in Learning More?

Subscribe today to stay informed and get regular updates from Kobiton

Ready to accelerate delivery of
your mobile apps?

Request a Demo