Skip to content

Add evaluation dataset creator#1279

Open
bjcmit wants to merge 9 commits intomicrosoft:mainfrom
bjcmit:feat/1267
Open

Add evaluation dataset creator#1279
bjcmit wants to merge 9 commits intomicrosoft:mainfrom
bjcmit:feat/1267

Conversation

@bjcmit
Copy link
Copy Markdown

@bjcmit bjcmit commented Apr 2, 2026

Description

This pull request adds a comprehensive new prompt, eval-dataset-creator.md, for generating evaluation datasets and documentation to support AI agent testing. The prompt guides users through a structured interview process to curate Q&A pairs, select evaluation metrics, and recommend tooling tailored to user skill level and agent characteristics. It also specifies the output directory structure and includes templates for all generated artifacts.

Key additions and improvements:

Evaluation Dataset Creation Workflow:

  • Introduces a multi-phase, interview-driven process for collecting agent context, capabilities, evaluation scenarios, and user requirements, ensuring high-quality and relevant dataset generation.
  • Mandates a review phase where sample Q&A pairs are validated with the user before finalizing the dataset.

Dataset and Documentation Artifacts:

  • Defines output structure in data/evaluation/ with separate subfolders for datasets (.json, .csv) and documentation (curation-notes.md, metric-selection.md, tool-recommendations.md).
  • Provides detailed JSON and CSV formats for the evaluation dataset, including metadata and balanced scenario distribution.
  • Supplies markdown templates for curation notes, metric selection, and tool recommendations, ensuring standardized and thorough documentation.

Tooling and Persona Guidance:

  • Recommends evaluation# Pull Request

Related Issue(s)

Closes #1267

Type of Change

Select all that apply:

Code & Documentation:

  • New feature (non-breaking change adding functionality)

Infrastructure & Configuration:

AI Artifacts:

  • Reviewed contribution with prompt-builder agent and addressed all feedback
  • Copilot agent (.github/agents/*.agent.md)

Sample Prompts (for AI Artifact Contributions)

User Request:

# Standards review only (invoke the agent directly):
@eval-dataset-creator create an evaluation dataset 

Execution Flow:

Here’s a step-by-step breakdown of what happens when the Evaluation Dataset Creator agent is invoked, including tool usage and key decision points:


  1. Structured Interview (Phases 1–4)

Purpose: Gather all necessary context before generating any artifacts.

Phase 1: Agent Context**

  • The agent asks six questions about the AI agent’s name, business scenario, KPIs, tasks, risks, and user adoption.
  • Decision Point: Wait for user responses before proceeding.

Phase 2: Agent Capabilities

  • Three questions about grounding sources, external tools/APIs, and response format.
  • Decision Point: Wait for user responses before proceeding.

Phase 3: Evaluation Scenarios

  • Five questions about typical, challenging, negative, and safety scenarios, plus limitations and topics to avoid.
  • Decision Point: Wait for user responses before proceeding.

Phase 4: Persona & Tooling

  • Two questions about development mode (low-code vs. pro-code) and evaluation frequency/type.
  • Decision Point: Wait for user responses before proceeding.

  1. Dataset Generation (Phase 5)
  • After the interview, the agent generates evaluation datasets:
    • JSON Format: Includes metadata, Q&A pairs, category, difficulty, tools expected, source references, and notes.
    • CSV Format: Similar structure, tools listed as semicolon-delimited.
  • Tool Usage: Writes files to data/evaluation/datasets/.
  • Decision Point: Ensures minimum 30 Q&A pairs, balanced distribution across scenario types.

  1. Dataset Review & Feedback (Phase 6)
  • Presents 5–8 representative Q&A pairs (covering easy, hard, grounding, negative, safety).
  • Asks the user for feedback on each pair:
    • Is the expected response accurate?
    • Should it be more/less detailed?
    • Are elements missing or incorrect?
    • Should the pair be modified, kept, or removed?
  • Decision Point: Refines dataset based on feedback. If major changes are needed, offers to regenerate portions.

  1. Documentation & Finalization (Phase 7)
  • Generates three supporting documents in data/evaluation/docs/:
    • Curation Notes: Business context, scope, data sources, review process, dataset balance, maintenance schedule.
    • Metric Selection: Agent characteristics, selected metrics, definitions, rationale.
    • Tool Recommendations: Persona profile, recommended tool, comparison, getting started, next steps.
  • Tool Usage: Writes files to data/evaluation/docs/.
  • Decision Point: Presents summary of all artifacts for user validation.

Decision Points & Tool Usage Summary

  • Interview: Structured Q&A, waits for user input before proceeding.
  • Dataset Generation: Automated file creation (JSON/CSV), ensures balance and completeness.
  • Review: Interactive feedback loop, offers regeneration if needed.
  • Documentation: Automated file creation for curation, metrics, and tooling.
  • Summary: Presents all artifacts for validation.

Output Artifacts:

data/evaluation/
├── datasets/
│   ├── <agent-name>-eval-dataset.json   # Full evaluation dataset (Q&A pairs + metadata)
│   └── <agent-name>-eval-dataset.csv    # Flat CSV version for Copilot Studio/manual review
└── docs/
    ├── <agent-name>-curation-notes.md        # Human-readable dataset rationale & scope
    ├── <agent-name>-metric-selection.md      # Metrics chosen + priorities + rationale
    └── <agent-name>-tool-recommendations.md  # MCS vs Azure AI Foundry guidance

data/evaluation/datasets/-eval-dataset.json

{
  "metadata": {
    "schema_version": "1",
    "agent_name": "example-agent",
    "created_date": "2026-04-02",
    "version": "1.0.0",
    "total_pairs": 30,
    "distribution": {
      "easy": 6,
      "grounding_source_checks": 3,
      "hard": 12,
      "negative": 6,
      "safety": 3
    },
    "persona": "pro-code",
    "evaluation_mode": ["manual", "batch"],
    "recommended_tool": "azure-ai-foundry"
  },
  "evaluation_pairs": [
    {

data/evaluation/docs/-curation-notes.md

# Curation Notes: Example Agent

## Business Context
Agent answers employee questions about expense and travel policy.

## Agent Scope
### In Scope
- Policy interpretation
- Step-by-step guidance
- Source citation

### Out of Scope
- Approvals
- Financial decisions

data/evaluation/docs/-metric-selection.md

# Metric Selection: Example Agent

## Selected Core Metrics
- Intent Resolution (High)
- Task Adherence (High)
- Groundedness (High)
- Response Completeness (Medium)

## Tool-Based Metrics
- Tool Call Accuracy (N/A)
- Latency (Medium)
- Token Cost (Medium)

data/evaluation/docs/-tool-recommendations.md

# Tool Recommendations: Example Agent

## Persona Profile
- Skill Level: Pro-Code Developer
- Evaluation Mode: Batch

## Recommended Tool
Azure AI Foundry

Selection Rationale:
Supports batch evaluation, groundedness metrics, and tool-call analysis.

Success Indicators:

  • All output artifacts exist and are non-empty
  • Datasets formatted correctly, contain at least 30 pairs, no empty 'expected_response' fields, and JSON = CSV.
  • Curation notes reflect business context and scope accurately
  • Metric priorities make sense for KPIs
  • Recommended tools matches states personas
  • Reality Check: Dataset imports into either Copilot Studio or Azure AI Foundry

Testing

  • Ran /prompt-analyze 3 times with all findings addressed
  • Tested agent against an Out-of-Office (OOO) Rescheduler feature
  • All validation commands pass:
    • npm run lint:all
    • npm run lint:md-links
    • npm run validate:copyright ✅ (148/148 files, 100%)
    • npm run spell-check ✅ (281 files, 0 issues)
    • npm run plugin:generate ✅ (14 plugins, 0 errors)
    • npm run plugin:validate ✅ (0 errors)
    • npm run lint:collections-metadata ✅ (0 errors)

Checklist

Required Checks

  • Documentation is updated (if applicable)
  • Files follow existing naming conventions
  • Changes are backwards compatible (if applicable)
  • Tests added for new functionality (if applicable)

AI Artifact Contributions

  • Used /prompt-analyze to review contribution
  • Addressed all feedback from prompt-builder review
  • Verified contribution follows common standards and type-specific requirements

Required Automated Checks

The following validation commands must pass before merging:

  • Markdown linting: npm run lint:md
  • Spell checking: npm run spell-check
  • Frontmatter validation: npm run lint:frontmatter
  • Skill structure validation: npm run validate:skills
  • Link validation: npm run lint:md-links
  • PowerShell analysis: npm run lint:ps
  • Plugin freshness: npm run plugin:generate

Security Considerations

  • This PR does not contain any sensitive or NDA information
  • Any new dependencies have been reviewed for security issues
  • Security-related scripts follow the principle of least privilege

@bjcmit bjcmit requested a review from a team as a code owner April 2, 2026 17:20
@bjcmit bjcmit self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Member

@WilliamBerryiii WilliamBerryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, @bjcmit. The eval-dataset-creator agent is a solid addition to the data-science collection — the structured interview flow and dual-persona support are well thought out.

After review, there are a few suggested changes in the inline comments. Please take a look and let us know if you have any questions.

Comment on lines +36 to +43
<!-- <interview-phase-1> -->
1. What is the name of the AI agent you are evaluating? If it does not have a name yet, give it one.
2. What specific business problem or scenario does this agent address?
3. What are the business KPIs associated with this agent (for example, increase revenue, decrease costs, transform business process)?
4. What tasks is this agent designed to perform? What is explicitly out of scope?
5. What are key risks (Responsible AI Framework) in implementing this agent (for example, PII vulnerabilities, negative impact from model inaccuracy)?
6. Who are the primary users of this agent? How likely is this agent to be adopted by primary users? What are barriers to adoption?
<!-- </interview-phase-1> -->
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML comment boundaries (<!-- <interview-phase-1> --><!-- </interview-phase-1> -->) work as section markers, but the pattern used by other agents in this repo is to express the workflow as an enumerated Required Protocol that spells out each rule or constraint as a numbered item. The current Required Protocol section at the bottom of this file has four items, which is a good start.

Consider moving more of the behavioral expectations from the XML-bounded sections into the protocol list or into the phase headings themselves. For examples of how other agents structure this, see:

  • .github/agents/hve-core/subagents/phase-implementor.agent.md — Required Protocol with numbered invariants that are referenced from the Required Steps.
  • .github/agents/hve-core/subagents/prompt-evaluator.agent.md — Required Protocol for evaluation-specific constraints paired with Required Steps.

This would make the constraints directly visible and enumerable rather than embedded in template comment tags.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow is already expressed as an enumerated Required Protocol. It also has XML comment boundaries. I can remove the XML comment boundaries, but it is unclear how to move more of the behavioral expectations into the protocol list or into the phase heading themselves.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.71%. Comparing base (84ddd5d) to head (e65a176).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1279      +/-   ##
==========================================
- Coverage   87.72%   87.71%   -0.02%     
==========================================
  Files          61       61              
  Lines        9320     9320              
==========================================
- Hits         8176     8175       -1     
- Misses       1144     1145       +1     
Flag Coverage Δ
pester 85.31% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(skill): add evaluation dataset creator skill

3 participants