Expert prompts for designing, building, and deploying autonomous AI agents — from single-agent task runners to multi-agent collaborative systems.
Single-agent task system design
Design an AI agent to accomplish: [describe the goal — e.g., "research a company and produce an investment memo"]. For this agent, specify: 1. System prompt: role, constraints, output format, and what to do when stuck 2. Tool set: exactly which tools it needs and with what permissions (read-only vs write) 3. Memory strategy: what context it needs to retain across steps 4. Termination criteria: how does it know it's done? 5. Guard rails: what actions should require human approval before executing? 6. Failure modes: what are the top 3 ways this agent could go wrong, and how to mitigate each?
Agent system prompt template
Write a system prompt for an AI agent with this role: [describe role — e.g., "customer support escalation agent"]. The system prompt must include: - Role definition and primary objective - The step-by-step process the agent should follow - What tools are available and when to use each - Output format for each type of task - What to do when the agent is uncertain or lacks information - Explicit prohibitions (what the agent must never do) - How to escalate or ask for clarification Keep the system prompt under 500 words. Test it by asking: would a capable human follow these instructions and produce the right result?
Custom tool specification
I'm building a custom tool for an AI agent called [tool name]. What it does: [description] Inputs available: [list data sources] Output: [what the tool should return] Write: 1. The tool function signature with typed parameters and return type 2. The tool description string (this is what the AI reads to decide when to use it — make it precise about when to call vs. not call this tool) 3. Input validation and error handling 4. A test case showing correct usage 5. Common misuse patterns and how the description prevents them Framework: [LangChain / OpenAI function calling / Anthropic tool use]
Agent evaluation framework
Build an evaluation framework for an AI agent that [describe task]. The eval should test: 1. Task completion rate: define what "complete" means for this task 2. Output quality: what makes an output good vs acceptable vs failure? 3. Efficiency: what's the acceptable range of steps/tokens to complete the task? 4. Safety: what outputs or actions would constitute a safety failure? 5. Edge cases: list 5 inputs that would stress-test the agent Write 10 evaluation test cases with: input, expected output behaviour, and pass/fail criteria. Include a scoring rubric for human raters to assess borderline outputs.
Multi-agent crew for content production
Design a CrewAI multi-agent system for [content production task — e.g., "weekly competitive intelligence report"]. Define each agent with: - Role name and backstory (2-3 sentences) - Goal (what this agent is responsible for) - Tools available - Expected output Suggested crew: Researcher → Analyst → Writer → Editor For each handoff, specify: what information passes between agents, and what the receiving agent does with it. Include the crew goal, task definitions, and the process (sequential vs hierarchical).
AutoGen conversation pattern
Design an AutoGen multi-agent conversation for: [task — e.g., "review a pull request and write test cases for it"]. Define: - Agent 1: [name, system message, model] - Agent 2: [name, system message, model] - (Additional agents if needed) - The initiating message that starts the conversation - Termination condition (what message or state ends the conversation) - Human proxy: when should a human be able to intervene? - Max conversation rounds Show the Python code to configure and run this conversation.
Stateful workflow graph
Design a LangGraph workflow for [task — e.g., "customer complaint resolution"]. The graph should have these nodes: [list 3-5 nodes — e.g., classify_complaint, look_up_order, draft_response, escalate, send_response] And these edges: - Normal flow: [node] → [node] - Conditional routing: after [node], route to [A] if [condition], else [B] - Looping: [node] can return to [earlier node] when [condition] For each node, specify: - What it does - Its inputs (from state) - What it adds to state - Possible next nodes Include the Python code structure for the StateGraph.
Agent orchestration and human oversight
Design an orchestration system for agents running [process — e.g., "automated invoice processing pipeline"] that handles: 1. Task queue: how incoming tasks are queued, prioritised, and assigned 2. State tracking: how to track which agent is working on what, with what status 3. Human-in-the-loop gates: which steps require human approval and how that's signalled 4. Error recovery: if an agent fails partway through, how does the system resume? 5. Audit log: what events to log and in what format for compliance 6. Monitoring: what alerts should fire and when (stuck agent, high error rate, SLA breach) Suggest the technology stack and sketch the data model.
Deep research agent prompt
You are a research agent. Your task: produce a comprehensive research brief on [topic]. Process: 1. Identify 5-7 key sub-questions that together answer the main topic 2. For each sub-question: search for relevant sources, extract key information, note source credibility 3. Synthesise findings across sources, noting where sources agree or conflict 4. Identify what is well-established vs uncertain or contested 5. Produce a structured brief: executive summary, key findings by theme, evidence quality assessment, gaps in current knowledge, and recommended further reading Cite your sources. Flag any claims you cannot verify. Aim for depth over breadth.
Competitive intelligence agent
Act as a competitive intelligence agent. Research [competitor company] and produce an intelligence brief. Search for and synthesise: 1. Recent product launches, updates, or announcements (last 6 months) 2. Pricing changes or new pricing tiers 3. Key hires or leadership changes 4. Funding, revenue, or growth signals 5. Customer sentiment: reviews, support complaints, community mentions 6. Strategic direction: blog posts, conference talks, job postings that signal roadmap Format: one-page brief with date-stamped findings. Note confidence level for each finding (confirmed / likely / rumour). Flag the 2-3 most strategically significant findings.
Company due diligence agent
You are a due diligence research agent. Research [company name] for a potential [investment / partnership / acquisition]. Investigate: 1. Business model and revenue sources 2. Market position and competitive landscape 3. Leadership team background and track record 4. Financial signals (funding history, revenue estimates, burn rate if available) 5. Technology stack and IP (patents, open source contributions) 6. Customer base: key clients, concentration risk, churn signals 7. Red flags: legal issues, employee reviews, regulatory actions, negative press Produce a structured report with confidence levels and source citations for each finding.
Industry news monitoring agent
Set up an industry monitoring agent for [industry/topic]. The agent should: 1. Track news about: [list 5-7 specific topics, companies, or trends to monitor] 2. For each item found: summarise in 2-3 sentences, assess significance (High/Medium/Low), tag by category 3. Filter out: [specify what to exclude — e.g., press releases, opinion pieces without data, duplicate coverage] 4. Output format: daily/weekly digest with items ranked by significance 5. Highlight: any item that represents a major competitive threat, market shift, or regulatory change Run this as a scheduled workflow. What sources should it monitor and how should it handle conflicting reports?
Automated data processing agent
Design an agent that processes [type of data — e.g., "incoming customer feedback from email and Typeform"] automatically. The agent should: 1. Ingest data from [sources] 2. Classify each item by [categories — e.g., feature request, bug report, compliment, complaint] 3. Extract structured fields: sentiment, urgency, product area affected, customer tier 4. Route to the appropriate team or system based on classification 5. Generate a weekly summary with trends and volume by category Specify: tool set needed, processing logic for each step, output format, and how errors or ambiguous cases are handled.
Email triage and response agent
Design an email triage agent for [use case — e.g., "sales inquiry inbox"]. The agent should: 1. Read and classify incoming emails by type: [list categories] 2. For standard enquiries: draft a personalised response using templates + context from CRM 3. For complex or high-value enquiries: flag for human review with a suggested response draft 4. For spam or irrelevant mail: archive without response 5. Log all actions to [CRM / spreadsheet / database] Define: the classification rules, draft response quality bar, escalation criteria, and how the agent should handle ambiguous emails it's unsure how to classify.
Automated reporting agent
Build an agent that generates a [weekly / monthly] [report type — e.g., "sales performance report"] automatically. Data sources: [list systems — CRM, analytics, spreadsheet, database] Report structure: [list sections — e.g., "executive summary, KPIs vs target, top performers, risks, recommended actions"] For each section, specify: - What data to pull and from where - How to calculate / aggregate it - What narrative or interpretation the agent should add (not just numbers) - What anomalies or thresholds should trigger a special callout Output: [format — PDF, HTML email, Slack message, Google Doc] Schedule: [timing and recipients]
Quality assurance and review agent
Design a QA agent that reviews [type of output — e.g., "blog posts before publication" / "code pull requests" / "customer proposals"]. The agent should check each item against: 1. [Quality criterion 1 — e.g., "factual accuracy: are all claims verifiable?"] 2. [Quality criterion 2 — e.g., "brand voice: does the tone match our guidelines?"] 3. [Quality criterion 3 — e.g., "completeness: are all required sections present?"] 4. [Quality criterion 4 — e.g., "formatting: does it follow the template?"] Output: a structured review with: pass/fail per criterion, specific issues with location (paragraph, line, section), and suggested fixes for each issue. Escalation: if [X criteria] fail, block publication and notify [person/channel].
Agent safety checklist
Review my AI agent design for safety risks and suggest mitigations. Agent purpose: [describe what the agent does] Tools it has access to: [list all tools and their permissions] Actions it can take autonomously: [list all actions without human approval] Actions that require human approval: [list gated actions] Please assess: 1. Worst-case failure mode: what's the most harmful thing this agent could do if it malfunctions? 2. Permission minimisation: are any tool permissions broader than strictly necessary? 3. Reversibility: which actions are irreversible and do they have appropriate gates? 4. Prompt injection risk: how could a malicious input manipulate the agent? 5. Audit trail: is there sufficient logging to reconstruct what happened if something goes wrong?
Agent governance framework
Design a governance framework for deploying AI agents in [organisation type — e.g., "a regulated financial services company"]. The framework should cover: 1. Agent inventory: how to document and register all deployed agents 2. Risk classification: how to categorise agents by risk level (Low / Medium / High) with criteria for each 3. Approval process: what review is required before deploying each risk class? 4. Ongoing monitoring: what metrics and alerts to maintain per agent 5. Incident response: what to do when an agent takes an unexpected or harmful action 6. Compliance: how to document agent behaviour for regulatory audit Keep it practical — this should be implementable by a team of 5 people, not require a compliance army.
Agent adversarial testing
Generate adversarial test cases for an AI agent that [describe agent purpose]. The agent has these tools: [list tools] And these constraints in its system prompt: [paste key constraints] Create test inputs designed to: 1. Jailbreak the agent into ignoring its constraints 2. Prompt injection: embed instructions in tool outputs or external data that try to redirect the agent 3. Resource abuse: inputs that could cause the agent to loop, make excessive API calls, or use extreme amounts of tokens 4. Social engineering: inputs that claim special authority or permissions 5. Boundary testing: inputs at the edges of what the agent is designed to handle For each test: input, expected safe response, and what a failure would look like.
Agent decision audit trail design
Design an audit logging system for an AI agent that [describe agent]. The audit log should capture: 1. For every agent run: start time, end time, initiating user/system, goal statement 2. For every tool call: tool name, inputs (sanitised — no credentials), outputs (truncated if large), timestamp, latency 3. For every decision point: the agent's reasoning, options considered, and path taken 4. For every action: what was done, what was changed, and whether human approval was obtained 5. Errors and retries: every failure with error details and recovery action 6. Final output: the agent's conclusion and confidence level Storage format, retention policy, and how to query the audit trail for a specific run or date range.
ReAct agent system prompt
Write a ReAct (Reason + Act) system prompt for an agent that [describe task]. The prompt must instruct the agent to: 1. Think: reason out loud about what to do before taking an action 2. Act: call a specific tool with specific inputs 3. Observe: interpret the tool result before deciding next steps 4. Repeat: loop until the goal is achieved or it determines it cannot proceed Available tools: [list tools] Termination: [what signals the task is done] Constraints: [list any "never do" rules] Output format: [what the final answer should look like] Include example reasoning traces showing the Thought → Action → Observation → Thought pattern.
Long-term memory system prompt
Write a system prompt for an agent that has access to a memory tool with these operations: - memory.save(key, value, description) — save a fact for later - memory.search(query) — retrieve relevant memories by semantic search - memory.forget(key) — remove a specific memory The prompt should instruct the agent to: 1. Proactively save information that will be useful in future interactions (user preferences, past decisions, important context) 2. Search memory at the start of each task to retrieve relevant prior context 3. Update memories when new information supersedes old 4. Use memory efficiently — save facts, not full conversation transcripts Include examples of what to save and what not to save.
Robust tool calling instructions
Write the tool-calling instructions section of a system prompt for an agent with these tools: [list tools with one-line descriptions] The instructions should cover: 1. When to call a tool vs reason from existing knowledge 2. How to handle tool errors (retry, try alternative tool, ask for help) 3. How to avoid unnecessary tool calls (don't call search for things you already know) 4. How to sequence tool calls when multiple are needed 5. What to do when a tool returns unexpected or empty results 6. How to cite tool results in the final response Include a decision tree: "Before calling a tool, ask yourself: [questions]"
Force consistent JSON output
Write a prompt addition that reliably makes an agent output structured JSON in this schema: [paste your desired JSON schema] The addition should: 1. Clearly specify the exact JSON structure expected 2. Include field descriptions and types for each key 3. Give an example of a correctly formatted output 4. Handle edge cases: what to put when a field is unknown or not applicable 5. Include an instruction to output ONLY the JSON object with no surrounding prose 6. Specify how to handle arrays: empty [] vs null vs omit entirely Test it by showing a sample input and the expected JSON output.
Level 1: Assisted
AI helps a human complete a task. Human stays in the loop for every decision. Example: Copilot suggesting code, ChatGPT drafting an email.
Level 2: Semi-Autonomous
AI completes multi-step tasks independently but with human checkpoints at key decisions. Example: Research agent that flags findings for review.
Level 3: Fully Autonomous
AI executes entire workflows end-to-end, only escalating for explicit exceptions. Requires robust safety guardrails and extensive testing before deploying.
Start at Level 1 or 2 for any new agent. Move to Level 3 only after validating behaviour across hundreds of real tasks.
Before going live with any AI agent, run through this checklist to catch the most common failure modes.
LangGraph
Graph-based orchestration with stateful cycles. Best for complex multi-step workflows with loops and conditional branching.
CrewAI
Role-based multi-agent framework. Each agent has a role, goal, and backstory. Best for collaborative agent teams with defined responsibilities.
AutoGen
Conversation-driven agents that collaborate via messages. Best for code generation tasks and human-in-the-loop workflows.
OpenAI Assistants API
Managed agent with persistent threads, file search, and code interpreter. Best for production use cases that need OpenAI-native tool access.
LangChain Agents
Flexible agent with a wide library of tools and integrations. Best for rapid prototyping and RAG-augmented agents.
Anthropic Tool Use
Structured tool calling via Claude API. Best for precise, controllable agents where output reliability is critical.