AI-Specific Metrics
Evaluates and measures the AI's actual contribution not just overall activity or vanity metrics.
General outcome metrics like 'number of users served' don't show what the AI itself is adding. Without metrics that isolate the AI's effect, teams risk confusing correlation with causation by attributing success to AI when it's actually staff, funding, or luck. This matters especially for equity: research shows some AI pilots that looked successful overall actually created wider disparities when outcomes were broken down by subgroup. Good metrics prove the AI is closing gaps, not just moving numbers around.
What Good Looks Like
✓ Gap-closing focus with measurable targets tied to the identified problem
✓ Tracks stable-early metrics like time saved and error rates (more reliable than benchmark scores)
✓ Three-lens approach: at least one metric each for ROI (efficiency), safety (error rates), and equity (fairness)
✓ Results disaggregated by relevant subgroups to catch hidden disparities
✓ Status-quo baseline showing AI objectively outperforms "the old way"
✓ Pre-registered metrics defined before pilot starts (not cherry-picked after)
✓ Named metrics owner with clear review cadence
✓ Defined rollback thresholds for when performance is too low to continue
What to Watch Out For
✗ Only tracking general outcomes like "users served" (doesn't isolate AI's effect)
✗ No plan to measure accuracy, error rates, or quality
✗ Not breaking down results by demographic subgroups (aggregate numbers hide disparities)
✗ No baseline comparison to show AI is better than the previous approach
✗ Relying solely on benchmark scores without real-world performance data
✗ Metrics chosen after seeing results (cherry-picking)
✗ No defined thresholds for when performance is too low
Tests To Apply
□ Do they track at least one metric from each category: efficiency (ROI), safety (error rates), and equity (fairness across groups)?
□ Are results disaggregated by relevant subgroups to catch hidden disparities?
□ Is there a status-quo baseline showing AI outperforms the previous approach?
□ Have they defined thresholds for when performance is too low to continue?
□ Is there a monitoring plan with named owners and review cadence?
□ Do they track stable-early metrics like time saved or error rates (not just user counts)?
□ Were metrics pre-registered before the pilot started?
□ Is there a causal design (A/B test, shadow deployment) to isolate the AI's effect?
Key Questions to Ask
-
What specific metrics prove the AI is adding value beyond what you could do without it?
-
How will you know if the AI starts performing worse over time?
-
Are you tracking outcomes separately for different demographic groups? What if one group has worse outcomes?
-
What's your plan if the AI doesn't meet your success criteria?
-
What's your baseline or how does this compare to the old way of doing things?
Apply the Cross-Cutting Lenses
After evaluating the core criteria above, apply these two additional lenses to assess equity outcomes and evidence quality.
Equity & Safety Check
When evaluating AI-Specific Metrics through the equity and safety lens, assess whether success is measured fairly across groups and whether metrics would catch harm to subpopulations.
Gate Assessment:
🟢 CONTINUE: Subgroup parity maintained or improving, safety incidents rare and quickly resolved
🟡 ADJUST: Disaggregation exists, gaps identified, active mitigation with measurable improvement
🔴 STOP: No disaggregation, or major equity gaps detected with no mitigation plan
Check for:
□ Are all metrics disaggregated by relevant subgroups (race, language, disability, geography, device type)?
□ Do they avoid "average success" metrics that could hide disparities (e.g., 80% accuracy overall but 40% for one group)?
□ Are safety metrics weighted by severity of harm, not just frequency (one severe incident > ten minor ones)?
□ Is there a named owner responsible for reviewing equity gaps in metrics monthly (or more frequently)?
□ Are rollback triggers set for when equity gaps widen beyond acceptable thresholds?
□ Do metrics include user trust/satisfaction by subgroup, not just technical performance?
Evidence & Uncertainty Check
When evaluating AI-Specific Metrics through the evidence and uncertainty lens, assess whether metrics actually isolate the AI's contribution and whether uncertainty is quantified.
Quality Grade:
🅰️ A (Strong): Causal design executed, tight confidence intervals, pre-registered metrics, clear baseline comparison
🅱️ B (Moderate): Quasi-experimental design, moderate uncertainty, plan to tighten in next phase
🅲 C (Weak): Correlational only, no baseline, cherry-picked metrics - do not scale without better evidence
Check for:
□ Is there a causal design (A/B test, shadow deployment, or difference-in-differences) to isolate AI's effect?
□ Are metrics pre-registered before pilot starts (not cherry-picked after seeing results)?
□ Are confidence intervals reported on all key metrics (not just point estimates)?
□ Is there a status-quo baseline showing AI outperforms the old way (with statistical significance)?
□ Do they track stable-early metrics (time saved, error rates) not just vanity metrics (users served)?
□ Are degradation triggers defined (if accuracy drops X%, we pause)?
□ Do they acknowledge what they DON'T know (e.g., "we can't yet measure long-term behavior change")?
