Agent Performance Optimization Workflow

Systematic improvement of existing agents through performance analysis, prompt engineering, and continuous iteration.

[Extended thinking: Agent optimization requires a data-driven approach combining performance metrics, user feedback analysis, and advanced prompt engineering techniques. Success depends on systematic evaluation, targeted improvements, and rigorous testing with rollback capabilities for production safety.]

Use this skill when

●Improving an existing agent's performance or reliability

●Analyzing failure modes, prompt quality, or tool usage

●Running structured A/B tests or evaluation suites

●Designing iterative optimization workflows for agents

Do not use this skill when

●You are building a brand-new agent from scratch

●There are no metrics, feedback, or test cases available

●The task is unrelated to agent performance or prompt quality

Instructions

1.Establish baseline metrics and collect representative examples.

2.Identify failure modes and prioritize high-impact fixes.

3.Apply prompt and workflow improvements with measurable goals.

4.Validate with tests and roll out changes in controlled stages.

Safety

●Avoid deploying prompt changes without regression testing.

●Roll back quickly if quality or safety metrics regress.

Phase 1: Performance Analysis and Baseline Metrics

Comprehensive analysis of agent performance using context-manager for historical data collection.

1.1 Gather Performance Data

Use: context-manager
Command: analyze-agent-performance $ARGUMENTS --days 30

Collect metrics including:

●Task completion rate (successful vs failed tasks)

●Response accuracy and factual correctness

●Tool usage efficiency (correct tools, call frequency)

●Average response time and token consumption

●User satisfaction indicators (corrections, retries)

●Hallucination incidents and error patterns

1.2 User Feedback Pattern Analysis

Identify recurring patterns in user interactions:

●Correction patterns: Where users consistently modify outputs

●Clarification requests: Common areas of ambiguity

●Task abandonment: Points where users give up

●Follow-up questions: Indicators of incomplete responses

●Positive feedback: Successful patterns to preserve

1.3 Failure Mode Classification

Categorize failures by root cause:

●Instruction misunderstanding: Role or task confusion

●Output format errors: Structure or formatting issues

●Context loss: Long conversation degradation

●Tool misuse: Incorrect or inefficient tool selection

●Constraint violations: Safety or business rule breaches

●Edge case handling: Unusual input scenarios

1.4 Baseline Performance Report

Generate quantitative baseline metrics:

Performance Baseline:
- Task Success Rate: [X%]
- Average Corrections per Task: [Y]
- Tool Call Efficiency: [Z%]
- User Satisfaction Score: [1-10]
- Average Response Latency: [Xms]
- Token Efficiency Ratio: [X:Y]

Phase 2: Prompt Engineering Improvements

Apply advanced prompt optimization techniques using prompt-engineer agent.

2.1 Chain-of-Thought Enhancement

Implement structured reasoning patterns:

Use: prompt-engineer
Technique: chain-of-thought-optimization

●Add explicit reasoning steps: "Let's approach this step-by-step..."

●Include self-verification checkpoints: "Before proceeding, verify that..."

●Implement recursive decomposition for complex tasks

●Add reasoning trace visibility for debugging

2.2 Few-Shot Example Optimization

Curate high-quality examples from successful interactions:

●Select diverse examples covering common use cases

●Include edge cases that previously failed

●Show both positive and negative examples with explanations

●Order examples from simple to complex

●Annotate examples with key decision points

Example structure:

Good Example:
Input: [User request]
Reasoning: [Step-by-step thought process]
Output: [Successful response]
Why this works: [Key success factors]

Bad Example:
Input: [Similar request]
Output: [Failed response]
Why this fails: [Specific issues]
Correct approach: [Fixed version]

2.3 Role Definition Refinement

Strengthen agent identity and capabilities:

●Core purpose: Clear, single-sentence mission

●Expertise domains: Specific knowledge areas

●Behavioral traits: Personality and interaction style

●Tool proficiency: Available tools and when to use them

●Constraints: What the agent should NOT do

●Success criteria: How to measure task completion

2.4 Constitutional AI Integration

Implement self-correction mechanisms:

Constitutional Principles:
1. Verify factual accuracy before responding
2. Self-check for potential biases or harmful content
3. Validate output format matches requirements
4. Ensure response completeness
5. Maintain consistency with previous responses

Add critique-and-revise loops:

●Initial response generation

●Self-critique against principles

●Automatic revision if issues detected

●Final validation before output

2.5 Output Format Tuning

Optimize response structure:

●Structured templates for common tasks

●Dynamic formatting based on complexity

●Progressive disclosure for detailed information

●Markdown optimization for readability

●Code block formatting with syntax highlighting

●Table and list generation for data presentation

Phase 3: Testing and Validation

Comprehensive testing framework with A/B comparison.

3.1 Test Suite Development

Create representative test scenarios:

Test Categories:
1. Golden path scenarios (common successful cases)
2. Previously failed tasks (regression testing)
3. Edge cases and corner scenarios
4. Stress tests (complex, multi-step tasks)
5. Adversarial inputs (potential breaking points)
6. Cross-domain tasks (combining capabilities)

3.2 A/B Testing Framework

Compare original vs improved agent:

Use: parallel-test-runner
Config:
  - Agent A: Original version
  - Agent B: Improved version
  - Test set: 100 representative tasks
  - Metrics: Success rate, speed, token usage
  - Evaluation: Blind human review + automated scoring

Statistical significance testing:

●Minimum sample size: 100 tasks per variant

●Confidence level: 95% (p < 0.05)

●Effect size calculation (Cohen's d)

●Power analysis for future tests

3.3 Evaluation Metrics

Comprehensive scoring framework:

Task-Level Metrics:

●Completion rate (binary success/failure)

●Correctness score (0-100% accuracy)

●Efficiency score (steps taken vs optimal)

●Tool usage appropriateness

●Response relevance and completeness

Quality Metrics:

●Hallucination rate (factual errors per response)

●Consistency score (alignment with previous responses)

●Format compliance (matches specified structure)

●Safety score (constraint adherence)

●User satisfaction prediction

Performance Metrics:

●Response latency (time to first token)

●Total generation time

●Token consumption (input + output)

●Cost per task (API usage fees)

●Memory/context efficiency

3.4 Human Evaluation Protocol

Structured human review process:

●Blind evaluation (evaluators don't know version)

●Standardized rubric with clear criteria

●Multiple evaluators per sample (inter-rater reliability)

●Qualitative feedback collection

●Preference ranking (A vs B comparison)

Phase 4: Version Control and Deployment

Safe rollout with monitoring and rollback capabilities.

4.1 Version Management

Systematic versioning strategy:

Version Format: agent-name-v[MAJOR].[MINOR].[PATCH]
Example: customer-support-v2.3.1

MAJOR: Significant capability changes
MINOR: Prompt improvements, new examples
PATCH: Bug fixes, minor adjustments

Maintain version history:

●Git-based prompt storage

●Changelog with improvement details

●Performance metrics per version

●Rollback procedures documented

4.2 Staged Rollout

Progressive deployment strategy:

1.Alpha testing: Internal team validation (5% traffic)

2.Beta testing: Selected users (20% traffic)

3.Canary release: Gradual increase (20% → 50% → 100%)

4.Full deployment: After success criteria met

5.Monitoring period: 7-day observation window

4.3 Rollback Procedures

Quick recovery mechanism:

Rollback Triggers:
- Success rate drops >10% from baseline
- Critical errors increase >5%
- User complaints spike
- Cost per task increases >20%
- Safety violations detected

Rollback Process:
1. Detect issue via monitoring
2. Alert team immediately
3. Switch to previous stable version
4. Analyze root cause
5. Fix and re-test before retry

4.4 Continuous Monitoring

Real-time performance tracking:

●Dashboard with key metrics

●Anomaly detection alerts

●User feedback collection

●Automated regression testing

●Weekly performance reports

Success Criteria

Agent improvement is successful when:

●Task success rate improves by ≥15%

●User corrections decrease by ≥25%

●No increase in safety violations

●Response time remains within 10% of baseline

●Cost per task doesn't increase >5%

●Positive user feedback increases

Post-Deployment Review

After 30 days of production use:

1.Analyze accumulated performance data

2.Compare against baseline and targets

3.Identify new improvement opportunities

4.Document lessons learned

5.Plan next optimization cycle

Continuous Improvement Cycle

Establish regular improvement cadence:

●Weekly: Monitor metrics and collect feedback

●Monthly: Analyze patterns and plan improvements

●Quarterly: Major version updates with new capabilities

●Annually: Strategic review and architecture updates

Remember: Agent optimization is an iterative process. Each cycle builds upon previous learnings, gradually improving performance while maintaining stability and safety.

agent-orchestration-improve-agent

Documentation