Business Term

Evals

AI Evaluation

AI評価セット

AI evaluation tests whether AI outputs or actions meet quality, safety, cost, and business criteria. It is a prerequisite for production generative AI and agent workflows.

Formula

Passing cases / eval cases

Use when

Teams can compare model, prompt, retrieval, and fine-tuning options with evidence.

Watch out

Accuracy, grounding, format, safety, bias, tool actions, review load

Updated: 07/04/2026Quality: ReviewedPage tier: Reviewed articleSources: 2

What it means

AI evaluation is the practice of repeatedly measuring whether an AI system's answers, decisions, tool calls, and long-task behavior meet expected criteria. It applies not only to classification with known labels, but also to summarization, grounded answers, drafting, code generation, and agent execution. A production evaluation defines test cases, scoring criteria, passing thresholds, human review ownership, and update cadence. Because models, prompts, retrieval data, and tools change, evaluation is an operating system for quality rather than a one-time acceptance check.

How to calculate it

AI evaluation uses scores for each task type and failure rates. Pass rate | Passing cases / eval cases | Shows minimum launch quality Critical failure rate | High-impact errors / eval cases | Captures incident risk Regression rate | Previously passing cases now failing / prior passing cases | Shows change impact

Lens	Formula / treatment	When to use it
Pass rate	Passing cases / eval cases	Shows minimum launch quality
Critical failure rate	High-impact errors / eval cases	Captures incident risk
Regression rate	Previously passing cases now failing / prior passing cases	Shows change impact

What counts / what does not

AI evaluation covers model, prompt, data, tools, UI, and operating rules. Include | Accuracy, grounding, format, safety, bias, tool actions, review load | Practical quality Exclude | One-off impressions, demo feel, model-name comparisons | Not reproducible Make explicit | Cases, rubric, pass line, reviewer, update cadence | Required for improvement

Item	Treatment	Why it matters
Include	Accuracy, grounding, format, safety, bias, tool actions, review load	Practical quality
Exclude	One-off impressions, demo feel, model-name comparisons	Not reproducible
Make explicit	Cases, rubric, pass line, reviewer, update cadence	Required for improvement

What moves the number

Evaluation quality depends on representative cases, failure cases, scoring criteria, and regression testing. Case design | Include real user questions and known failures Rubric | Make pass/fail explainable Failure examples | Boundary and risk cases prevent incidents Regression | Verify changes do not break known-good behavior

Driver	Metric impact
Case design	Include real user questions and known failures
Rubric	Make pass/fail explainable
Failure examples	Boundary and risk cases prevent incidents
Regression	Verify changes do not break known-good behavior

When it helps

Teams can compare model, prompt, retrieval, and fine-tuning options with evidence. Launch gates can block unacceptable failures before production. AI product improvement becomes KPI-driven rather than impression-driven.

Teams can compare model, prompt, retrieval, and fine-tuning options with evidence.
Launch gates can block unacceptable failures before production.
AI product improvement becomes KPI-driven rather than impression-driven.

How to use it

AI evaluation measures AI output and behavior reproducibly.
It includes prompts, data, tools, UI, and operations, not only the model.
Representative, failure, and boundary cases are needed for production confidence.
Model or prompt changes require regression tests.
Critical failure rate matters as much as average performance.

Decision cautions

Do not launch on a high average score alone. High-impact use cases need critical-failure gates, not just averages. Easy eval sets create false confidence. Human review should check scorer alignment and rubric drift.

High-impact use cases need critical-failure gates, not just averages.
Easy eval sets create false confidence.
Human review should check scorer alignment and rubric drift.

Read with

AI evaluation is the base layer for generative AI, prompting, tuning, and agents. Generative AI | Produces outputs to evaluate | Needs launch gates Prompt Engineering | Changes need measurement | Keeps iteration grounded AI Agent | Long tasks and tool use need evaluation | Success rate alone is insufficient

Metric	Role	Why read together
Generative AI	Produces outputs to evaluate	Needs launch gates
Prompt Engineering	Changes need measurement	Keeps iteration grounded
AI Agent	Long tasks and tool use need evaluation	Success rate alone is insufficient

Example

A marketing team uses AI to draft campaign ideas. Initially reviewers judge outputs subjectively, so a model change quietly reduces quality. The team creates an eval set with good and bad prior examples and scores persona fit, brand tone, prohibited phrases, evidence, and CTA clarity. Every prompt change is compared on pass rate and critical failure rate. A brand-violation failure blocks release even if the average score improves. The discussion becomes evidence-based rather than opinion-based.

Compare with

Metric	Difference	Why read together
AI Evaluation	Tests outputs and behavior	Judges production quality
A/B Test	Compares user response	Measures real-world impact after launch
Monitoring	Tracks production behavior	Detects drift after release

Common mistakes

Evaluation is not one-and-done. Model and data changes require re-evaluation.
A few human spot checks are not enough without representative and failure cases.
A high average score can hide unacceptable critical failures.

Frequently asked questions

Is AI evaluation only model comparison?

No. It evaluates prompts, data, retrieval, tools, UI, and operating rules.

How many cases are needed?

It depends on risk. Start with representative and dangerous cases, then expand from usage logs.

Is average score enough?

No. Critical failures and prohibited behavior should be separate launch blockers.

Sources

Sources	Kind	Link
NIST: AI RMF	tier_s	Open
NIST: Generative AI Profile	tier_s	Open

On this page

What it means How to calculate it What counts / what does not What moves the number When it helps How to use it Decision cautions Read with Example Compare with Common mistakes Frequently asked questions Sources Related topics

Trust

Quality: Reviewed
Page tier: Reviewed article
Updated: 07/04/2026
COI: None
Sources: 2

This page is reference information for research and learning. For accounting, legal, finance, health, security, or other individual decisions, confirm against primary sources or qualified professionals.

Read editorial policy Send a correction

AI-readable

Read-only preview for Reviewed terms.

JSON Markdown

Trust

Quality: Reviewed
Page tier: Reviewed article
Updated: 07/04/2026
COI: None
Sources: 2

Read editorial policy Send a correction

AI-readable

Read-only preview for Reviewed terms.

JSON Markdown