Fine-tuning
ファイン・チューニング
Fine-tuning adapts an existing model with additional training data so it better follows a domain, format, style, or classification policy. It should be compared against prompting and retrieval before adoption.
What it means
Fine-tuning adjusts a pretrained model with additional examples chosen for a specific task. It can improve consistent formatting, domain language, classification behavior, tone, or repeated workflow behavior. It is usually not the right tool for keeping current facts, enforcing permissions, or reading private databases at request time; retrieval and tool use often fit those needs better. A responsible fine-tuning decision requires data provenance, privacy review, bias review, baseline comparison, holdout evaluation, and rollback planning.
How to calculate it
Evaluate fine-tuning by improvement over a baseline and by ongoing operating cost. Quality lift | Tuned score - baseline score | Shows whether training adds value Consistency lift | Reduction in format or policy violations | Shows stability gains Total cost | Training + evaluation + operations - prompt savings | Tests long-term viability
| Lens | Formula / treatment | When to use it |
|---|---|---|
| Quality lift | Tuned score - baseline score | Shows whether training adds value |
| Consistency lift | Reduction in format or policy violations | Shows stability gains |
| Total cost | Training + evaluation + operations - prompt savings | Tests long-term viability |
What counts / what does not
Fine-tuning adapts behavior. It is not a substitute for fresh knowledge retrieval or authorization. Include | Style, format, classification policy, domain phrasing, stable repeated tasks | Good candidates for tuning Exclude | Fresh facts, database access, permissions, truth guarantees | Use retrieval or tools Make explicit | Data source, eval set, failure conditions, retraining triggers | Keeps quality auditable
| Item | Treatment | Why it matters |
|---|---|---|
| Include | Style, format, classification policy, domain phrasing, stable repeated tasks | Good candidates for tuning |
| Exclude | Fresh facts, database access, permissions, truth guarantees | Use retrieval or tools |
| Make explicit | Data source, eval set, failure conditions, retraining triggers | Keeps quality auditable |
What moves the number
Impact depends on data quality, evaluation design, baseline strength, and task stability. Data quality | Clean, consistent examples improve behavior Evaluation | Holdout tests prevent false confidence Task stability | Changing requirements increase retraining cost Baseline | If prompting or RAG is enough, tuning may be unnecessary
| Driver | Metric impact |
|---|---|
| Data quality | Clean, consistent examples improve behavior |
| Evaluation | Holdout tests prevent false confidence |
| Task stability | Changing requirements increase retraining cost |
| Baseline | If prompting or RAG is enough, tuning may be unnecessary |
When it helps
Teams can decide whether a consistency problem should be solved through training rather than prompts. Fresh-knowledge needs can be routed to retrieval or tools instead of being forced into training. Separating training and evaluation data reduces the risk of overestimating performance.
- Teams can decide whether a consistency problem should be solved through training rather than prompts.
- Fresh-knowledge needs can be routed to retrieval or tools instead of being forced into training.
- Separating training and evaluation data reduces the risk of overestimating performance.
How to use it
- Fine-tuning adapts behavior; it is not a universal knowledge update mechanism.
- Try prompting, retrieval, and tool use before training when possible.
- Data quality, rights, confidentiality, and bias directly affect the tuned model.
- A baseline and holdout evaluation are required to judge success.
- Production use needs monitoring, retraining criteria, and rollback.
Decision cautions
Bad training data can make bad behavior more consistent. Do not mix training and evaluation data; overfitting hides production failures. Do not use confidential or rights-unclear data without approval. Re-evaluate when the base model, product policy, or data distribution changes.
- Do not mix training and evaluation data; overfitting hides production failures.
- Do not use confidential or rights-unclear data without approval.
- Re-evaluate when the base model, product policy, or data distribution changes.
Read with
Fine-tuning should be selected against prompting, retrieval, and evaluation alternatives. Prompt Engineering | Improves behavior through input design | Usually the first layer to try RAG | Supplies external knowledge at request time | Better for freshness and citations AI Evaluation | Measures before and after | Required for launch decisions
| Metric | Role | Why read together |
|---|---|---|
| Prompt Engineering | Improves behavior through input design | Usually the first layer to try |
| RAG | Supplies external knowledge at request time | Better for freshness and citations |
| AI Evaluation | Measures before and after | Required for launch decisions |
Example
A customer success team wants consistent classification of churn reasons into ten labels. Prompting produces inconsistent labels for similar tickets. The team considers fine-tuning with previously verified labeled tickets, removes personal data, and keeps a separate holdout set. The tuned model improves classification consistency, but it still misclassifies reasons tied to a new pricing plan. The team keeps plan information in retrieval and limits fine-tuning to stable label behavior. The launch decision is based on holdout accuracy, review effort, and a rollback path to the prompt-only baseline.
Compare with
Fine-tuning | Adapts behavior through training | Useful for consistency Prompting | Controls each request | Useful when requirements change often RAG | Retrieves knowledge | Useful for current facts and evidence
| Metric | Difference | Why read together |
|---|---|---|
| Fine-tuning | Adapts behavior through training | Useful for consistency |
| Prompting | Controls each request | Useful when requirements change often |
| RAG | Retrieves knowledge | Useful for current facts and evidence |
Common mistakes
- Fine-tuning should not be used to memorize all internal knowledge. Fresh and permissioned facts need retrieval or tools.
- More data is not always better. Inconsistent examples can reduce quality.
- A tuned model still requires evaluation and monitoring.
Frequently asked questions
Does fine-tuning replace RAG?
Usually no. Use RAG or tools for fresh, source-grounded, or permissioned information.
When is it worth trying?
When prompts do not make a stable behavior reliable and you have clean training data plus a holdout evaluation set.
What is the main risk?
Training on noisy, confidential, biased, or rights-unclear data can make the wrong behavior consistent.