Prompt Injection
プロンプト・インジェクション
Prompt injection is an attack or failure pattern where untrusted text tries to override the AI system's intended instructions. It matters most when models read external content, retrieve documents, or use tools.
What it means
Prompt injection occurs when user input, retrieved documents, web pages, emails, tickets, or other untrusted text contains instructions that attempt to redirect the model away from its intended system or developer instructions. It is difficult to treat as ordinary input validation because language models interpret text as both data and instruction. The risk grows in RAG, browsing, tool use, MCP integrations, and AI agents. Mitigation requires separating trusted instructions from untrusted content, minimizing tool permissions, adding human confirmation for high-impact actions, and logging tool calls and evidence.
How to calculate it
Risk is evaluated by exposure to untrusted input, tool permissions, and unconfirmed high-impact actions. Attack exposure | Number of untrusted-input paths | Grows with web, email, and document ingestion Impact | Tool authority x data sensitivity | Measures potential damage Defense pass rate | Blocked or confirmed dangerous cases / test cases | Measures mitigation coverage
| Lens | Formula / treatment | When to use it |
|---|---|---|
| Attack exposure | Number of untrusted-input paths | Grows with web, email, and document ingestion |
| Impact | Tool authority x data sensitivity | Measures potential damage |
| Defense pass rate | Blocked or confirmed dangerous cases / test cases | Measures mitigation coverage |
What counts / what does not
Prompt injection is not just a bad question; it is untrusted content being treated as instruction. Include | Malicious instructions embedded in web pages, emails, PDFs, tickets, or chat | External input risk Exclude | Ordinary mistakes, typos, generic model hallucination | Different failure path Make explicit | Trust boundary, tool permissions, confirmation UI, logs, test cases | Required for defense design
| Item | Treatment | Why it matters |
|---|---|---|
| Include | Malicious instructions embedded in web pages, emails, PDFs, tickets, or chat | External input risk |
| Exclude | Ordinary mistakes, typos, generic model hallucination | Different failure path |
| Make explicit | Trust boundary, tool permissions, confirmation UI, logs, test cases | Required for defense design |
What moves the number
Risk changes with external content, execution authority, permissions, UI, and logs. External input | More external reading means more exposure to adversarial text Tool permissions | Write or send tools increase impact Confirmation UI | Human review can stop dangerous actions Logs | Evidence and tool-call history support incident analysis
| Driver | Metric impact |
|---|---|
| External input | More external reading means more exposure to adversarial text |
| Tool permissions | Write or send tools increase impact |
| Confirmation UI | Human review can stop dangerous actions |
| Logs | Evidence and tool-call history support incident analysis |
When it helps
Teams can define trust boundaries before letting RAG or agents read external content. Tool scopes can be split into read-only, draft, execute, and external-send levels. Attack cases can be added to evaluation before launch.
- Teams can define trust boundaries before letting RAG or agents read external content.
- Tool scopes can be split into read-only, draft, execute, and external-send levels.
- Attack cases can be added to evaluation before launch.
How to use it
- Prompt injection is the risk that untrusted content becomes an instruction to the AI.
- It is especially important in RAG, browsing, tool use, and agents.
- Prompt wording alone cannot fully prevent it.
- Least privilege, human confirmation, input separation, logs, and tests are required.
- High-impact actions should not execute solely on model judgment.
Decision cautions
Assume external documents may be adversarial. Show users the evidence and action before calling high-impact tools. Do not treat text such as ignore previous instructions as privileged instruction when it comes from untrusted content. Use separate approval layers for secrets, external sending, deletion, purchases, and permission changes.
- Show users the evidence and action before calling high-impact tools.
- Do not treat text such as ignore previous instructions as privileged instruction when it comes from untrusted content.
- Use separate approval layers for secrets, external sending, deletion, purchases, and permission changes.
Read with
Prompt injection should be read with AI agents, MCP, and tool use. AI Agent | Reads external information and uses tools | Higher potential impact MCP | Connects tools and resources | Requires scoped exposure and confirmation AI Evaluation | Tests attack cases | Measures mitigation effectiveness
| Metric | Role | Why read together |
|---|---|---|
| AI Agent | Reads external information and uses tools | Higher potential impact |
| MCP | Connects tools and resources | Requires scoped exposure and confirmation |
| AI Evaluation | Tests attack cases | Measures mitigation effectiveness |
Example
An AI agent reads web pages for competitor research. One page contains hidden text telling the model to send internal notes outside the company. If the agent has no external-send tool, the attack has little impact; if it has send permission, the risk is high. The team changes the design so external page text is never treated as trusted instruction, send tools can only draft, human confirmation is required before sending, tool-call logs are kept, and attack strings are added to the evaluation set.
Compare with
Prompt Injection | Hidden instruction in input redirects AI | External input plus tools Hallucination | Model gives incorrect content | Handled by grounding and evaluation Authorization flaw | User can do what they should not | Handled by access control
| Metric | Difference | Why read together |
|---|---|---|
| Prompt Injection | Hidden instruction in input redirects AI | External input plus tools |
| Hallucination | Model gives incorrect content | Handled by grounding and evaluation |
| Authorization flaw | User can do what they should not | Handled by access control |
Common mistakes
- A stronger system prompt is not enough. Permissions and confirmation still matter.
- The risk is not only malicious users. Web pages and documents can contain hostile instructions.
- Reviewing final output is not enough. Tool calls and data access must be audited.
Frequently asked questions
Can prompt engineering prevent prompt injection?
It can reduce some cases, but permissions, tool design, confirmation UI, logs, and evaluation are also needed.
Does this matter for RAG?
Yes. Retrieved documents can contain adversarial instructions that the model may follow if controls are weak.
What is the first mitigation?
Separate untrusted input from trusted instructions, minimize tool permissions, and require confirmation for high-impact actions.